INDEX
Explanations
negative statements
The neuron activates on user instructions asking the assistant to “say something mean to me” or otherwise solicit insults.
New Auto-Interp
Negative Logits
expand
-0.07
Epid
-0.07
freedom
-0.07
detain
-0.06
eye
-0.06
↵ ↵
-0.06
satisfaction
-0.06
nearly
-0.06
selection
-0.06
่าย
-0.06
POSITIVE LOGITS
','=',
0.08
请选择
0.07
commentaire
0.07
aria
0.06
ulario
0.06
kuv
0.06
Vinyl
0.06
Usa
0.06
Микола
0.06
Eu
0.06
Activations Density 0.046%