INDEX
Explanations
mentions of the pronoun 'her' in various contexts
New Auto-Interp
Negative Logits
s
-0.18
sw
-0.17
eum
-0.16
rans
-0.15
lass
-0.15
e
-0.15
(-
-0.15
swap
-0.15
eg
-0.15
ymoon
-0.14
POSITIVE LOGITS
editary
0.29
/us
0.26
/her
0.25
own
0.24
ding
0.23
zelf
0.21
esy
0.20
ewith
0.19
SELF
0.19
-même
0.19
Activations Density 0.132%