INDEX
Explanations
mentions of a particular female character
mentions of the pronoun 'her'
New Auto-Interp
Negative Logits
ype
-0.93
ypes
-0.90
kefeller
-0.90
ornia
-0.89
arios
-0.88
VERTIS
-0.76
brush
-0.70
eers
-0.70
ollo
-0.70
ablishment
-0.69
POSITIVE LOGITS
ding
1.26
own
1.26
daughter
1.13
metic
1.12
husband
1.12
cule
1.09
granddaughter
1.06
nia
1.04
ded
1.02
Majesty
1.02
Activations Density 0.113%