INDEX
Explanations
This neuron lights up on aggressive or insulting language that labels people or groups as enemies or worthy of vilification.
New Auto-Interp
Negative Logits
Ậ
-0.07
císa
-0.06
بوب
-0.06
187
-0.06
graphical
-0.06
одатель
-0.06
fingert
-0.06
िच
-0.06
.Logf
-0.06
defender
-0.06
POSITIVE LOGITS
(exc
0.06
]|[
0.06
looph
0.06
_sys
0.06
canlı
0.06
глу
0.06
Column
0.06
Basically
0.06
::__
0.06
akin
0.05
Activations Density 0.254%