INDEX
Explanations
statements indicating actions or opinions expressed by individuals
New Auto-Interp
Negative Logits
ignum
-0.16
виÑĤ
-0.14
illy
-0.14
awe
-0.14
anz
-0.13
irsch
-0.13
ignKey
-0.13
ins
-0.13
orrh
-0.13
_TER
-0.13
POSITIVE LOGITS
roke
0.15
Zot
0.15
Äįin
0.14
vign
0.14
inx
0.14
zilla
0.14
rador
0.14
ROKE
0.13
alar
0.13
421
0.13
Activations Density 0.089%