INDEX
Explanations
negative stereotypes and how they reinforce harmful narratives
New Auto-Interp
Negative Logits
gur
-0.86
sterdam
-0.85
ilated
-0.76
Aid
-0.75
Journals
-0.75
ayan
-0.75
cel
-0.73
imentary
-0.73
ates
-0.72
keeping
-0.72
POSITIVE LOGITS
è¦ļéĨĴ
1.05
pmwiki
1.04
stereotyp
0.99
trope
0.92
tropes
0.89
clich
0.87
enegger
0.86
ALLY
0.80
rities
0.80
stereotypes
0.78
Activations Density 6.969%