INDEX
Explanations
mentions of men and gender-related terms
New Auto-Interp
Negative Logits
')")
-0.95
']")
-0.95
]')
-0.86
—
-0.86
الإنجليزية
-0.84
ligiloj
-0.83
]]
-0.82
"");
-0.81
$")
-0.80
).)
-0.78
POSITIVE LOGITS
men
3.37
Men
3.15
Men
3.01
MEN
2.79
men
2.59
MEN
2.31
hommes
1.92
Männer
1.86
hombres
1.85
mens
1.76
Activations Density 0.064%