INDEX
Explanations
possessive pronouns and their associations
New Auto-Interp
Negative Logits
himself
-0.27
his
-0.22
його
-0.20
his
-0.19
Himself
-0.17
его
-0.17
His
-0.16
His
-0.16
jeho
-0.16
jego
-0.16
POSITIVE LOGITS
yourself
0.33
yourselves
0.26
SELF
0.26
own
0.23
ths
0.22
’e
0.22
your
0.21
’re
0.20
your
0.20
nger
0.20
Activations Density 0.218%