INDEX
Explanations
words associated with stereotypes
references to stereotypes and their implications or effects
New Auto-Interp
Negative Logits
sterdam
-0.79
ayan
-0.75
ateur
-0.74
ertodd
-0.72
rique
-0.67
inth
-0.66
ighters
-0.66
nesty
-0.65
sis
-0.65
ighth
-0.65
POSITIVE LOGITS
stereotyp
1.01
stereotypes
0.91
stereotype
0.90
è¦ļéĨĴ
0.78
clich
0.77
rities
0.76
depictions
0.75
portrayal
0.73
tropes
0.72
Breaker
0.71
Activations Density 0.019%