INDEX
Explanations
The neuron detects the “Yes/No” answer‐option tokens (including the slash) in the consistency‐checking prompt.
New Auto-Interp
Negative Logits
())) ↵
-0.07
ичес
-0.06
Qaeda
-0.06
enas
-0.06
-Qaeda
-0.06
ाइम
-0.06
script
-0.06
_isr
-0.06
(problem
-0.06
ानत
-0.06
POSITIVE LOGITS
_specific
0.07
annually
0.07
гар
0.06
RPG
0.06
VK
0.06
responding
0.06
fries
0.06
invariably
0.06
gồm
0.06
om
0.06
Activations Density 0.002%