INDEX
Explanations
book titles/sections
The neuron activates on numeric tokens used as page or chapter numbers in tables of contents.
New Auto-Interp
Negative Logits
denně
-0.07
дії
-0.07
ice
-0.07
kvinne
-0.07
рам
-0.06
ním
-0.06
нав
-0.06
gay
-0.06
(bin
-0.06
ैय
-0.06
POSITIVE LOGITS
0.09
0.08
0.08
0.08
0.08
0.08
0.08
0.08
↵
0.08
0.07
Activations Density 0.007%