INDEX
Explanations
the presence of articles or quantifiers in various contexts
New Auto-Interp
Negative Logits
ories
-0.17
222
-0.16
lights
-0.15
agal
-0.15
th
-0.15
Bitte
-0.15
illes
-0.14
kır
-0.14
erville
-0.14
ãĥ¼ãĥĬ
-0.14
POSITIVE LOGITS
dozen
0.27
hundred
0.22
thousand
0.18
undred
0.17
decade
0.16
ught
0.15
Vog
0.14
Dek
0.14
century
0.14
doz
0.14
Activations Density 0.049%