INDEX
Explanations
words related to gender
mentions of gender and related topics
New Auto-Interp
Negative Logits
GOODMAN
-0.73
amina
-0.70
BLIC
-0.70
Provided
-0.68
Warrant
-0.68
steen
-0.67
Memor
-0.67
iries
-0.66
etsk
-0.66
ernels
-0.65
POSITIVE LOGITS
dysph
1.02
endered
0.96
genders
0.93
gender
0.89
equality
0.89
fuck
0.87
bender
0.85
pronouns
0.84
imbalance
0.84
stereotypes
0.83
Activations Density 0.016%