INDEX
Explanations
mentions of gender-related topics
references to gender-related topics and issues
New Auto-Interp
Negative Logits
BLIC
-0.77
amina
-0.71
ernels
-0.70
Warrant
-0.69
Showtime
-0.67
Gi
-0.66
iries
-0.66
Mub
-0.65
steen
-0.65
Interstitial
-0.65
POSITIVE LOGITS
dysph
1.17
pronouns
0.95
equality
0.90
Equality
0.88
bender
0.87
identity
0.87
fuck
0.87
stereotypes
0.86
imbalance
0.85
endered
0.84
Activations Density 0.022%