INDEX
Explanations
references to female characters or groups in various contexts
New Auto-Interp
Negative Logits
eer
-0.23
male
-0.18
ally
-0.17
male
-0.17
males
-0.16
gentlemen
-0.16
raman
-0.16
Male
-0.16
ality
-0.15
lid
-0.15
POSITIVE LOGITS
hood
0.29
friend
0.23
friends
0.22
/man
0.22
/w
0.22
riend
0.20
ies
0.19
ie
0.19
enger
0.18
teenth
0.18
Activations Density 0.031%