INDEX
Explanations
phrases related to social perceptions and stereotypes
New Auto-Interp
Negative Logits
vine
-0.79
alde
-0.79
ulo
-0.75
orst
-0.71
imentary
-0.70
hement
-0.70
depended
-0.68
bis
-0.68
fam
-0.67
interrupted
-0.67
POSITIVE LOGITS
masculinity
0.85
criminality
0.77
how
0.74
reality
0.73
sexuality
0.71
life
0.71
events
0.71
perfection
0.69
homosexuality
0.69
morality
0.68
Activations Density 0.082%