INDEX
Explanations
stereotypes and related terms
references to stereotypes
New Auto-Interp
Negative Logits
ayan
-0.79
inth
-0.74
sterdam
-0.73
rique
-0.71
ateur
-0.68
gan
-0.68
ighters
-0.67
ighth
-0.67
light
-0.66
packing
-0.65
POSITIVE LOGITS
stereotyp
1.01
stereotype
0.90
stereotypes
0.89
rities
0.81
portrayal
0.80
depictions
0.80
portray
0.80
è¦ļéĨĴ
0.80
caricature
0.73
clich
0.72
Activations Density 0.018%