INDEX
Explanations
keywords related to gender identity and gender-related topics
references to gender in various contexts
New Auto-Interp
Negative Logits
Gerr
-0.73
Lent
-0.71
Gi
-0.70
ernels
-0.67
Prosecut
-0.66
WT
-0.66
Mub
-0.64
BLIC
-0.64
Grave
-0.64
amina
-0.64
POSITIVE LOGITS
dysph
1.17
equality
1.00
pronouns
0.96
Equality
0.96
identity
0.90
stereotypes
0.89
imbalance
0.85
flu
0.85
bending
0.85
bender
0.83
Activations Density 0.028%