INDEX
Explanations
references to gender, particularly focusing on male and female distinctions
New Auto-Interp
Negative Logits
guys
-0.25
Guys
-0.22
boys
-0.20
men
-0.20
Boys
-0.18
Sisters
-0.18
guy
-0.18
ners
-0.17
males
-0.17
ladies
-0.16
POSITIVE LOGITS
volent
0.41
-dominated
0.29
factor
0.29
fic
0.27
/f
0.26
-bodied
0.23
uada
0.23
vol
0.22
faction
0.22
bonding
0.21
Activations Density 0.020%