INDEX
Explanations
The neuron fires on tokens that appear in polite, self-commitment phrases—especially in “I’ll do my best to help/assist you”–style offers of assistance.
New Auto-Interp
Negative Logits
sexo
-0.07
lying
-0.07
قابل
-0.07
خی
-0.07
(pt
-0.06
_stderr
-0.06
vitro
-0.06
صح
-0.06
původ
-0.06
while
-0.06
POSITIVE LOGITS
Southwest
0.07
.toolStripButton
0.06
gerald
0.06
underestimated
0.06
PRI
0.06
istra
0.06
MOT
0.06
讲
0.06
NORTH
0.06
BASE
0.06
Activations Density 0.007%