INDEX
Explanations
proper nouns and punctuation
New Auto-Interp
Negative Logits
if
-1.45
ondas
-1.45
their
-1.42
which
-1.38
at
-1.34
(
-1.23
promote
-1.23
سبة
-1.16
when
-1.16
just
-1.13
POSITIVE LOGITS
()">
1.13
1.07
</em>
1.05
xxiv
1.05
dunque
1.04
magnificent
1.04
horrid
1.04
different
1.02
xxii
1.02
during
1.02
Activations Density 0.001%