INDEX
Explanations
The neuron activates on mentions of anger or aggressive/emotional hostility (e.g. “anger,” “angry,” “aggression”).
New Auto-Interp
Negative Logits
ebook
-0.07
follic
-0.07
-0.07
702
-0.07
touted
-0.07
نو
-0.06
experience
-0.06
coincide
-0.06
44
-0.06
15
-0.06
POSITIVE LOGITS
anger
0.12
angry
0.10
rage
0.07
Angry
0.07
άλ
0.07
ující
0.07
_KERNEL
0.07
.opacity
0.06
PY
0.06
geniş
0.06
Activations Density 0.007%