INDEX
Explanations
"the" followed by specific nouns
tokens that never activate — an effectively inactive neuron.
New Auto-Interp
Negative Logits
\]
0.26
T
0.25
;
0.24
2
0.22
.]
0.21
।
0.21
I
0.21
^{*}0.21
𝗔
0.20
1
0.20
POSITIVE LOGITS
to
0.30
algunos
0.23
soldats
0.22
kprop
0.22
exemplu
0.21
of
0.21
bardziej
0.21
dược
0.21
актриса
0.21
at
0.21
Activations Density 0.012%