INDEX
Explanations
references to gender, specifically male and female
New Auto-Interp
Negative Logits
Notion
-0.77
ap
-0.70
?>">
-0.70
Conquer
-0.65
"]
-0.65
}}"></
-0.64
ed
-0.64
YAP
-0.64
man
-0.64
dill
-0.63
POSITIVE LOGITS
male
1.91
Male
1.86
MALE
1.80
female
1.78
FEMALE
1.75
Female
1.71
Male
1.71
MALE
1.69
Female
1.65
female
1.62
Activations Density 0.087%