INDEX
Explanations
punctuation and code
This neuron detects tokens involved in defining or assigning the assistant’s persona or role (e.g. “NAME_1,” “author,” and similar meta‐instruction placeholders).
New Auto-Interp
Negative Logits
zag
-0.07
mnop
-0.06
サー
-0.06
-but
-0.06
하지만
-0.06
alah
-0.06
ilos
-0.06
kan
-0.06
анные
-0.06
�
-0.06
POSITIVE LOGITS
pineapple
0.08
incidence
0.07
Braun
0.06
BEGIN
0.06
Information
0.06
charity
0.06
calorie
0.06
ategorized
0.06
ček
0.06
univerz
0.06
Activations Density 0.001%