INDEX
Explanations
the word "He" appearing in the text
mentions of a particular individual named "He" or a similar pronoun
New Auto-Interp
Negative Logits
Thoughts
-0.67
Manip
-0.65
reality
-0.60
Disclosure
-0.59
Cumm
-0.57
Nichols
-0.57
Gems
-0.56
legality
-0.56
Appropriations
-0.55
anonymously
-0.55
POSITIVE LOGITS
arer
1.19
eded
1.17
arers
1.15
lling
1.11
eding
1.09
lder
1.01
arth
0.99
ather
0.99
ALTH
0.98
pton
0.97
Activations Density 0.096%