INDEX
Explanations
references to gender, particularly focusing on male identities and issues related to masculinity
New Auto-Interp
Negative Logits
guys
-0.23
Guys
-0.21
boys
-0.19
men
-0.19
ners
-0.18
Boys
-0.17
edly
-0.15
rei
-0.15
guy
-0.15
hombres
-0.15
POSITIVE LOGITS
volent
0.45
-dominated
0.36
fic
0.33
/f
0.28
factor
0.28
-bodied
0.28
-only
0.28
-centric
0.26
-led
0.26
vol
0.25
Activations Density 0.021%