INDEX
Explanations
references to societal roles and stereotypes
New Auto-Interp
Negative Logits
μμ
-0.16
oller
-0.16
.echo
-0.15
GURL
-0.15
rana
-0.14
ÑĢава
-0.14
zw
-0.14
itung
-0.14
hlen
-0.14
cul
-0.14
POSITIVE LOGITS
stereotype
0.28
stereotypes
0.25
stere
0.25
Ster
0.23
ster
0.18
pec
0.17
asso
0.16
stereo
0.15
pigeon
0.15
vil
0.15
Activations Density 0.028%