INDEX
Explanations
references to women or feminine pronouns
mentions of the pronoun "her" in various contexts
New Auto-Interp
Negative Logits
ured
-0.61
bluff
-0.60
curfew
-0.59
fty
-0.58
elta
-0.57
govtrack
-0.57
prints
-0.55
Cage
-0.55
prints
-0.54
sock
-0.54
POSITIVE LOGITS
itage
1.22
tz
1.19
ding
1.07
ald
0.97
pes
0.96
itance
0.90
jee
0.89
mite
0.88
lich
0.88
mone
0.87
Activations Density 0.046%