INDEX
Explanations
phrases related to social justice and activism
instances of the end-of-text token
New Auto-Interp
Negative Logits
hindsight
-0.94
quir
-0.90
glitch
-0.80
fuzz
-0.79
quirks
-0.78
glitches
-0.78
abase
-0.77
accidentally
-0.75
detecting
-0.74
clust
-0.74
POSITIVE LOGITS
Amen
1.25
Peace
1.19
Quran
1.08
peace
0.99
Therefore
0.98
å¿
0.92
ðŁ
0.89
Peace
0.89
Pope
0.89
âĢķ
0.88
Activations Density 0.477%