INDEX
Explanations
words related to gender
references to gender and gender identity
New Auto-Interp
Negative Logits
ernels
-0.75
BLIC
-0.73
Grave
-0.71
Gerr
-0.68
esm
-0.67
edia
-0.67
enium
-0.67
ģĸ
-0.67
akings
-0.66
lege
-0.65
POSITIVE LOGITS
dysph
1.29
equality
1.16
imbalance
1.10
pronouns
1.09
identity
1.05
stereotypes
1.04
Equality
1.01
stereotyp
0.99
flu
0.98
symmetry
0.98
Activations Density 0.030%