INDEX
Explanations
The neuron responds to occurrences of the word “ban” (i.e. calls for prohibitions).
New Auto-Interp
Negative Logits
Coeff
-0.08
thorough
-0.08
heart
-0.08
粗
-0.07
Elliot
-0.07
Heart
-0.07
Wood
-0.07
Cardio
-0.07
uco
-0.07
Coefficient
-0.07
POSITIVE LOGITS
ban
0.14
Ban
0.12
banned
0.12
banning
0.10
Ban
0.10
bans
0.09
AN
0.08
raids
0.08
nam
0.07
outlaw
0.07
Activations Density 0.007%