INDEX
Explanations
instances of gender representation and biases in various contexts
New Auto-Interp
Negative Logits
ikh
-0.17
anse
-0.16
oron
-0.16
utut
-0.15
iverz
-0.15
regn
-0.15
تÙĪÙĨ
-0.14
lover
-0.14
inton
-0.14
ushman
-0.14
POSITIVE LOGITS
female
0.48
male
0.44
females
0.41
Female
0.39
gender
0.38
male
0.37
female
0.36
males
0.36
女æĢ§
0.35
women
0.35
Activations Density 0.158%