INDEX
Explanations
the frequency of the pronoun 'I' in the text
New Auto-Interp
Negative Logits
c
-0.26
p
-0.25
e
-0.24
orem
-0.22
b
-0.21
v
-0.20
x
-0.19
a
-0.19
z
-0.19
h
-0.19
POSITIVE LOGITS
E
0.23
M
0.21
C
0.20
A
0.19
D
0.19
O
0.19
L
0.19
TRGL
0.18
P
0.18
N
0.18
Activations Density 0.020%