INDEX
Explanations
Code and text snippets
The neuron fires on tokens from the policy/instruction header (e.g. words like “history,” “insult,” “competitive,” “innuendos,” etc.), i.e. it detects system‐level instruction or policy text rather than user content.
New Auto-Interp
Negative Logits
harvest
-0.07
�
-0.06
İN
-0.06
міністра
-0.06
лиц
-0.06
закін
-0.06
以上
-0.06
轮
-0.06
HAVE
-0.06
٥
-0.06
POSITIVE LOGITS
.mouse
0.07
看到
0.06
Mec
0.06
twisting
0.06
rv
0.06
.Hour
0.06
Sebastian
0.06
entrev
0.06
прав
0.06
cott
0.06
Activations Density 0.024%