INDEX
Explanations
references to female characters or titles related to women
New Auto-Interp
Negative Logits
en
-0.75
al
-0.73
u
-0.71
Warszawie
-0.66
Wilmington
-0.66
Roswell
-0.65
prostitutes
-0.63
Exxon
-0.62
rito
-0.62
encephalitis
-0.62
POSITIVE LOGITS
LADY
1.28
Lady
1.23
LADY
1.12
Lady
1.09
lady
1.08
Ladybug
0.98
lady
0.96
Ladies
0.85
ladybug
0.85
ladies
0.83
Activations Density 0.006%