INDEX
Explanations
mentions or descriptions of male individuals
references to males in various contexts
New Auto-Interp
Negative Logits
Tokens
-0.81
roll
-0.81
akings
-0.77
ateg
-0.73
planes
-0.72
Annotations
-0.72
Market
-0.71
eries
-0.70
Yards
-0.69
wal
-0.68
POSITIVE LOGITS
male
3.56
female
2.92
males
2.87
Male
2.75
male
2.65
Male
2.53
Female
2.47
female
2.44
females
2.30
Female
2.27
Activations Density 0.018%