INDEX
Explanations
proper nouns, particularly names of individuals and titles
New Auto-Interp
Negative Logits
in
-0.18
D
-0.17
T
-0.16
H
-0.16
C
-0.16
B
-0.16
-0.15
from
-0.15
with
-0.15
to
-0.15
POSITIVE LOGITS
in
0.19
ar
0.17
any
0.17
an
0.17
on
0.17
ina
0.17
it
0.17
us
0.16
im
0.16
is
0.16
Activations Density 0.510%