INDEX
Explanations
references to the speaker or first-person perspectives
New Auto-Interp
Negative Logits
cliffe
-0.19
nya
-0.18
rous
-0.17
stein
-0.17
so
-0.17
n
-0.16
themselves
-0.16
lass
-0.16
itself
-0.16
l
-0.16
POSITIVE LOGITS
/us
0.38
SELF
0.23
/her
0.23
adows
0.22
asuring
0.20
zzo
0.19
andering
0.18
zelf
0.18
-même
0.18
ury
0.17
Activations Density 0.077%