INDEX
Explanations
occurrences of the pronoun "I."
New Auto-Interp
Negative Logits
gnore
-0.33
l
-0.33
mp
-0.30
K
-0.29
SS
-0.28
lluminate
-0.27
rish
-0.26
ron
-0.25
ch
-0.25
t
-0.25
POSITIVE LOGITS
HM
0.16
HF
0.15
llum
0.15
HK
0.15
reland
0.15
bid
0.15
YK
0.14
brahim
0.14
rvine
0.14
PL
0.14
Activations Density 0.081%