INDEX
Explanations
gender stereotypes
references to stereotypes and their impacts
New Auto-Interp
Negative Logits
hner
-0.75
inth
-0.74
ayan
-0.73
mits
-0.71
ositories
-0.69
kos
-0.68
metic
-0.68
endez
-0.67
gan
-0.66
mission
-0.65
POSITIVE LOGITS
stereotype
1.22
stereotypes
1.16
stereotyp
1.15
rities
0.94
caricature
0.84
prejudice
0.83
stereotypical
0.82
clich
0.81
caric
0.79
è¦ļéĨĴ
0.76
Activations Density 0.013%