INDEX
Explanations
actions performed by himself
New Auto-Interp
Negative Logits
your
0.64
yourselves
0.63
your
0.61
Your
0.57
yourself
0.56
him
0.55
lui
0.53
iyong
0.53
彼
0.53
yourself
0.52
POSITIVE LOGITS
himself
1.48
తన
1.16
തന്റെ
1.13
نفسه
1.00
his
0.97
Himself
0.93
자신의
0.92
தனது
0.92
ತನ್ನ
0.86
своему
0.83
Activations Density 0.015%