INDEX
Explanations
The neuron fires on words that denote cruelty, violence, or sadistic infliction of pain.
New Auto-Interp
Negative Logits
Florian
-0.07
Transportation
-0.07
Matte
-0.07
Doug
-0.06
still
-0.06
phi
-0.06
]))
-0.06
"fmt
-0.06
descendant
-0.06
incontr
-0.06
POSITIVE LOGITS
vý
0.08
видов
0.08
инт
0.07
Sab
0.07
AB
0.07
mim
0.07
Guid
0.06
151
0.06
MEM
0.06
mak
0.06
Activations Density 0.008%