INDEX
Explanations
phrases related to female individuals
references to a specific individual, focusing on their actions and attributes
New Auto-Interp
Negative Logits
ype
-0.91
ypes
-0.91
kefeller
-0.87
ornia
-0.86
inctions
-0.70
brush
-0.69
arios
-0.69
VERTIS
-0.67
ENSE
-0.66
ozy
-0.66
POSITIVE LOGITS
husband
1.35
ding
1.34
metic
1.23
cule
1.19
herself
1.16
daughter
1.08
ded
1.07
nia
1.05
boyfriend
1.03
Majesty
1.01
Activations Density 0.118%