INDEX
Explanations
references to self-affirmation and personal identity
New Auto-Interp
Negative Logits
their
-0.74
leurs
-0.68
ihre
-0.64
glected
-0.63
ihrer
-0.61
在我的
-0.58
Roskov
-0.54
Their
-0.52
Diwedd
-0.52
cherchés
-0.52
POSITIVE LOGITS
yourself
2.44
Yourself
2.04
YOURSELF
1.83
yourself
1.79
Yourself
1.54
thyself
1.47
yourselves
1.38
oneself
1.15
himſelf
1.07
itſelf
1.02
Activations Density 0.076%