INDEX
Explanations
character descriptions
This neuron activates on internal metadata tokens (e.g. header or control code markers like `<|start_header_id|>`) rather than actual content words.
New Auto-Interp
Negative Logits
Field
-0.07
Honda
-0.06
hape
-0.06
_birth
-0.06
-mean
-0.06
_lit
-0.06
prosecutors
-0.06
Bos
-0.06
strdup
-0.06
不安
-0.06
POSITIVE LOGITS
swirl
0.07
علاق
0.07
jadx
0.06
特色
0.06
accomplished
0.06
classe
0.06
ýn
0.06
-inspired
0.06
책
0.05
denying
0.05
Activations Density 0.024%