INDEX
Explanations
pronouns referring to ourselves
references to self-identity and self-perception
New Auto-Interp
Negative Logits
orie
-0.75
onna
-0.70
Sierra
-0.65
lets
-0.65
cemic
-0.64
ros
-0.63
heny
-0.63
emis
-0.63
Klu
-0.62
mer
-0.62
POSITIVE LOGITS
selves
1.61
ourselves
1.25
tremend
1.01
self
0.98
selves
0.96
exting
0.95
proport
0.84
eleph
0.82
perspect
0.81
exha
0.81
Activations Density 0.006%