INDEX
Explanations
The neuron detects the "model" (assistant) speaker token—i.e., the start of model/assistant responses.
New Auto-Interp
Negative Logits
orum
0.42
інформа
0.42
нашей
0.42
ще
0.42
да
0.42
нных
0.42
paraphr
0.42
пита
0.41
orifice
0.41
горе
0.41
POSITIVE LOGITS
your
0.46
secretly
0.44
hopelessly
0.43
ඔබේ
0.43
fascist
0.42
இரவு
0.42
secret
0.42
blazing
0.42
YOUR
0.41
Your
0.41
Activations Density 0.057%