INDEX
Explanations
words related to feelings of contempt or disrespect towards authority or specific groups
New Auto-Interp
Negative Logits
ramid
-0.69
hemor
-0.68
Lans
-0.65
NetMessage
-0.64
encyclopedia
-0.63
toget
-0.61
akeru
-0.60
reconstruction
-0.60
stabilization
-0.60
advoc
-0.60
POSITIVE LOGITS
uously
1.55
uous
1.52
fully
1.20
ible
1.18
ibly
1.09
ful
1.04
ateurs
1.02
ardless
1.01
urous
1.00
orable
0.99
Activations Density 0.041%