INDEX
Explanations
code and documentation
The neuron fires on instructional or meta‐prompt language—especially negation cues like “not” and related instructional terms indicating prohibitions.
New Auto-Interp
Negative Logits
bais
-0.07
Z
-0.07
_third
-0.07
z
-0.06
interpolated
-0.06
�
-0.06
الك
-0.06
研
-0.06
Repeat
-0.06
.magic
-0.06
POSITIVE LOGITS
없습니다
0.06
.ย
0.06
теб
0.06
TokenType
0.06
vro
0.06
нес
0.06
unleashed
0.06
CKER
0.06
Dund
0.06
ँ
0.06
Activations Density 0.016%