INDEX
Explanations
concepts related to gender
terms related to gender identity and gender disparities
New Auto-Interp
Negative Logits
BLIC
-0.75
iries
-0.72
Gi
-0.71
ernels
-0.71
ournal
-0.68
Mub
-0.68
Interstitial
-0.68
Warrant
-0.67
amina
-0.67
Gerr
-0.67
POSITIVE LOGITS
dysph
1.06
Equality
0.95
pronouns
0.93
equality
0.92
stereotypes
0.88
bender
0.86
identity
0.85
bent
0.84
bending
0.84
fuck
0.83
Activations Density 0.030%