INDEX
Explanations
affirmative statements or confirmations
New Auto-Interp
Negative Logits
emer
-0.17
asley
-0.17
chas
-0.15
ont
-0.15
emoc
-0.15
amage
-0.15
ванов
-0.14
iko
-0.14
Hale
-0.14
stand
-0.14
POSITIVE LOGITS
atty
0.16
arth
0.16
forth
0.15
orable
0.15
éĹ
0.15
γει
0.15
quant
0.14
ÑĢаÑħ
0.14
anton
0.14
leck
0.14
Activations Density 0.047%