INDEX
Explanations
references to gender-related topics in various contexts
references to gender and gender-related topics
New Auto-Interp
Negative Logits
ernels
-0.77
BLIC
-0.73
WT
-0.72
Lent
-0.68
Gerr
-0.68
Gi
-0.67
Warrant
-0.67
edia
-0.65
iths
-0.65
Grave
-0.65
POSITIVE LOGITS
dysph
1.22
equality
1.07
imbalance
0.98
pronouns
0.97
identity
0.96
Equality
0.96
stereotypes
0.94
flu
0.91
discrimination
0.91
inequality
0.90
Activations Density 0.046%