INDEX
Explanations
references to gender inequalities and societal expectations
New Auto-Interp
Negative Logits
icari
-0.10
arget
-0.09
indir
-0.08
รม
-0.08
aris
-0.08
allah
-0.08
ãĤĤãĤĬ
-0.08
aç
-0.08
onse
-0.08
anzi
-0.07
POSITIVE LOGITS
male
0.24
males
0.20
Male
0.18
male
0.18
çĶ·æĢ§
0.16
masculine
0.16
Male
0.15
men
0.14
мÑĥжÑĩин
0.14
mascul
0.14
Activations Density 0.032%