INDEX
Explanations
varied text sources
The neuron strongly fires on the assistant’s stock self-description “As an AI language model,” essentially detecting that exact self-referential phrase.
New Auto-Interp
Negative Logits
transparent
-0.07
prefer
-0.07
chod
-0.06
zbek
-0.06
كات
-0.06
بان
-0.06
.layoutControlItem
-0.06
Translator
-0.06
pře
-0.06
Express
-0.06
POSITIVE LOGITS
Amendment
0.08
-range
0.07
внес
0.07
прям
0.06
mektedir
0.06
OBJECT
0.06
раниц
0.06
memorable
0.06
ươ
0.06
проблема
0.06
Activations Density 0.010%