INDEX
Explanations
instances of the pronoun "I" and related self-referential phrases
New Auto-Interp
Negative Logits
ord
-0.15
uchs
-0.14
sworn
-0.14
swore
-0.14
ific
-0.14
were
-0.13
tatus
-0.13
ake
-0.13
were
-0.13
embros
-0.13
POSITIVE LOGITS
ronic
0.23
mean
0.21
Mean
0.18
OW
0.17
Mean
0.17
HONE
0.17
therefore
0.17
certainly
0.16
wish
0.16
wonder
0.16
Activations Density 0.262%