INDEX
Explanations
This neuron activates on the special token marking the assistant speaker role indicator.
New Auto-Interp
Negative Logits
]↵
-0.06
Radi
-0.06
radi
-0.06
")]↵
-0.06
حمل
-0.06
یط
-0.05
/usr
-0.05
SON
-0.05
IAM
-0.05
!"↵
-0.05
POSITIVE LOGITS
ộn
0.07
ْع
0.07
Advice
0.07
이어
0.07
_converter
0.07
Species
0.07
TASK
0.07
moyen
0.07
ưỡng
0.07
resembled
0.06
Activations Density 0.061%