INDEX
Explanations
This neuron responds to speaker-label tokens or role tags (e.g. NAME_, user/assistant headers) that mark who is speaking.
New Auto-Interp
Negative Logits
ТО
-0.07
orthodox
-0.07
stk
-0.06
|array
-0.06
Bent
-0.06
ське
-0.06
enviar
-0.06
Лу
-0.06
seeker
-0.06
PROC
-0.06
POSITIVE LOGITS
amend
0.07
픽
0.06
fuck
0.06
Launch
0.06
Labels
0.06
elev
0.06
_until
0.06
�
0.06
ıda
0.06
Bird
0.06
Activations Density 0.048%