INDEX
Explanations
adjectives related to gender characteristics
terms related to gender identity and expressions of femininity and masculinity
New Auto-Interp
Negative Logits
Assembly
-0.82
oard
-0.81
undo
-0.79
RAY
-0.72
Redemption
-0.71
oulos
-0.70
Grant
-0.66
Stone
-0.64
owitz
-0.63
rave
-0.62
POSITIVE LOGITS
istries
0.89
inity
0.86
feminine
0.78
masculine
0.77
fem
0.77
女
0.77
hygiene
0.75
xual
0.74
pronouns
0.72
inant
0.68
Activations Density 0.044%