INDEX
Explanations
The neuron is primarily triggered by uppercase abbreviations and acronyms (multi‐letter all-caps tokens).
New Auto-Interp
Negative Logits
ing
-0.07
對
-0.07
—even
-0.07
황
-0.07
也不
-0.06
narrowed
-0.06
not
-0.06
від
-0.06
้จ
-0.06
well
-0.06
POSITIVE LOGITS
a
0.21
ra
0.18
ula
0.17
ka
0.17
ga
0.16
ha
0.16
RA
0.16
la
0.16
A
0.16
pa
0.16
Activations Density 0.779%