INDEX
Explanations
references to personal experiences and self-referential statements
New Auto-Interp
Negative Logits
ness
-0.23
themselves
-0.23
itself
-0.20
nya
-0.20
ly
-0.19
ship
-0.18
naire
-0.18
Ùĩا
-0.17
weise
-0.17
wise
-0.17
POSITIVE LOGITS
/us
0.58
/her
0.34
/my
0.29
adows
0.29
zzo
0.28
personally
0.28
SELF
0.28
adow
0.28
andering
0.25
-même
0.25
Activations Density 0.117%