INDEX
Explanations
punctuation, particularly periods
New Auto-Interp
Negative Logits
eç
-0.16
675
-0.16
odia
-0.15
INST
-0.15
otre
-0.15
ewan
-0.15
esel
-0.15
aoke
-0.14
fter
-0.14
zdy
-0.14
POSITIVE LOGITS
лим
0.16
Ïĥαν
0.14
BaÄŁ
0.13
ãĥ¡ãĥ©
0.13
leDb
0.13
Alf
0.13
exhaust
0.13
íĻ
0.13
216
0.13
Ïģια
0.13
Activations Density 0.030%