INDEX
Explanations
words and phrases related to discussing opinions and characteristics of groups of people
references to societal perceptions and stereotypes about various groups
New Auto-Interp
Negative Logits
looph
-0.69
ework
-0.65
Stretch
-0.64
Torrent
-0.63
maneu
-0.63
vere
-0.63
withdrew
-0.63
defied
-0.60
Banner
-0.60
annis
-0.60
POSITIVE LOGITS
invariably
1.39
usually
1.04
inevitably
0.99
cringe
0.98
often
0.93
usually
0.91
referring
0.90
typically
0.87
reply
0.86
rase
0.83
Activations Density 0.350%