INDEX
Explanations
personal pronouns and reflexive pronouns
references to the concept of self or personal identity
New Auto-Interp
Negative Logits
Amend
-0.67
Sierra
-0.62
airspace
-0.61
Clover
-0.61
racket
-0.59
Vil
-0.59
akings
-0.59
Noon
-0.58
Scrib
-0.58
allas
-0.58
POSITIVE LOGITS
selves
1.20
destruct
0.92
ortium
0.91
pecially
0.88
ridges
0.87
self
0.86
theless
0.78
terday
0.77
explanatory
0.76
contained
0.76
Activations Density 0.019%