INDEX
Explanations
uppercase letters in the text
New Auto-Interp
Negative Logits
mlin
-0.17
ru
-0.17
yla
-0.16
иÑĢа
-0.16
io
-0.15
ully
-0.15
ullet
-0.15
yy
-0.15
ern
-0.15
gnore
-0.15
POSITIVE LOGITS
em
0.18
hum
0.17
bread
0.17
pard
0.17
oga
0.17
ex
0.16
KM
0.15
emm
0.15
los
0.15
rix
0.15
Activations Density 0.172%