INDEX
Explanations
expressions of criticism and negative sentiment towards individuals or groups
New Auto-Interp
Negative Logits
omu
-0.16
Nar
-0.15
856
-0.15
otland
-0.15
ador
-0.15
mou
-0.15
oen
-0.14
iali
-0.14
uka
-0.14
Bale
-0.14
POSITIVE LOGITS
inh
0.16
Thur
0.14
жи
0.14
á»Ĩ
0.14
itecture
0.14
ÏĦο
0.14
.disk
0.14
ISO
0.14
elt
0.14
Exited
0.14
Activations Density 0.408%