INDEX
Explanations
references to personal pronouns and possessive adjectives
New Auto-Interp
Negative Logits
ington
-0.16
possibility
-0.15
ered
-0.15
ness
-0.14
ering
-0.14
iosa
-0.14
ord
-0.14
itest
-0.14
lifestyles
-0.13
(
-0.13
POSITIVE LOGITS
own
0.43
/her
0.30
SELF
0.26
próp
0.25
Own
0.24
Own
0.24
self
0.24
own
0.24
_own
0.23
sel
0.22
Activations Density 0.965%