INDEX
Explanations
negative or contradictory language
New Auto-Interp
Negative Logits
ACHI
-0.16
ah
-0.15
ahr
-0.15
Ñĩи
-0.14
Spit
-0.13
ç¨
-0.13
atte
-0.13
Ãij
-0.13
iver
-0.13
lan
-0.13
POSITIVE LOGITS
ãĥ³ãĥĨãĤ£
0.17
ODB
0.16
iron
0.16
gba
0.15
XB
0.14
kup
0.14
GBK
0.14
bomb
0.14
ánu
0.14
IRON
0.14
Activations Density 0.008%