INDEX
Explanations
references to stereotypes and biases
New Auto-Interp
Negative Logits
ico
-0.18
elan
-0.17
iero
-0.17
sik
-0.17
rai
-0.17
ayo
-0.15
Um
-0.15
eco
-0.15
idine
-0.15
nels
-0.15
POSITIVE LOGITS
embr
0.16
598
0.16
0.14
rez
0.14
snap
0.14
zos
0.14
.snap
0.13
acus
0.13
aws
0.13
con
0.13
Activations Density 0.145%