INDEX
Explanations
references to gender, particularly males and females
New Auto-Interp
Negative Logits
Notion
-0.81
Inscription
-0.69
kasarigan
-0.68
?>">
-0.67
}}"></
-0.67
Krise
-0.66
dill
-0.66
شهاد
-0.65
obatan
-0.64
hut
-0.63
POSITIVE LOGITS
Male
1.76
male
1.72
MALE
1.66
Male
1.59
MALE
1.59
FEMALE
1.52
males
1.48
female
1.48
male
1.47
Female
1.45
Activations Density 0.100%