INDEX
Explanations
positive personal changes
This neuron fires on tokens in the assistant’s generated responses (i.e. it marks words produced by the model, not the user).
New Auto-Interp
Negative Logits
िथ
-0.07
flation
-0.07
ouro
-0.06
Luther
-0.06
bud
-0.06
ifie
-0.06
主任
-0.06
мага
-0.06
Acts
-0.06
colder
-0.06
POSITIVE LOGITS
=YES
0.07
]<<
0.07
컵
0.07
_ALREADY
0.06
_Variable
0.06
}','
0.06
.chunk
0.06
ман
0.06
>'.
0.06
último
0.06
Activations Density 0.056%