INDEX
Explanations
This neuron fires on tokens in the model’s own generated (assistant) responses rather than on user or system prompt text.
New Auto-Interp
Negative Logits
_birth
-0.07
Flesh
-0.07
Pat
-0.06
maxY
-0.06
jections
-0.06
milf
-0.06
KA
-0.06
filename
-0.06
faker
-0.06
.Perform
-0.06
POSITIVE LOGITS
(disposing
0.06
=
0.06
_SELECTED
0.06
.dsl
0.06
-counter
0.06
驾
0.06
perder
0.06
nạn
0.06
hayvan
0.06
komm
0.06
Activations Density 0.056%