INDEX
Explanations
expressions of personal identity and self-reference
New Auto-Interp
Negative Logits
ourselves
-0.79
our
-0.72
we
-0.65
Our
-0.63
OUR
-0.63
nossos
-0.59
their
-0.59
bizi
-0.57
Our
-0.57
我们的
-0.55
POSITIVE LOGITS
myself
0.80
EndContext
0.76
I
0.75
my
0.74
Tôi
0.73
moje
0.72
myself
0.71
讓我
0.70
私の
0.70
אני
0.69
Activations Density 0.125%