INDEX
Explanations
mentions of a specific female individual
New Auto-Interp
Negative Logits
kefeller
-0.85
ype
-0.83
ypes
-0.81
vernment
-0.74
undo
-0.73
ornia
-0.72
antage
-0.66
ustom
-0.65
hovah
-0.64
redo
-0.64
POSITIVE LOGITS
pher
1.38
herself
1.37
athed
1.20
husband
1.13
ding
1.13
athing
1.11
pard
1.11
metic
1.04
cule
1.04
vagina
0.96
Activations Density 0.533%