INDEX
Explanations
references to personal experiences and self-identity
New Auto-Interp
Negative Logits
itself
-0.25
ness
-0.21
themselves
-0.20
ly
-0.19
wers
-0.18
ting
-0.17
rette
-0.16
appen
-0.16
ship
-0.16
nya
-0.16
POSITIVE LOGITS
/us
0.63
/her
0.43
personally
0.33
/my
0.30
zelf
0.28
-même
0.28
adows
0.27
adow
0.27
SELF
0.27
andering
0.26
Activations Density 0.248%