INDEX
Explanations
words related to beliefs, stereotypes, and societal issues
beliefs or stereotypes surrounding masculinity and gender roles
New Auto-Interp
Negative Logits
sidx
-0.74
cussion
-0.70
ovember
-0.69
cember
-0.68
arthed
-0.67
laughs
-0.67
Multiple
-0.67
mentioned
-0.67
adel
-0.66
ftime
-0.65
POSITIVE LOGITS
somehow
1.16
magically
1.11
infall
1.10
immutable
1.09
superiority
1.03
invincible
1.00
innate
1.00
inherently
0.99
virtuous
0.96
benevolent
0.94
Activations Density 0.737%