INDEX
Explanations
words related to social injustice and marginalization
New Auto-Interp
Negative Logits
šem
-0.17
unsch
-0.16
uzzi
-0.16
ceptors
-0.14
urally
-0.14
lical
-0.14
amed
-0.14
sto
-0.14
ddy
-0.13
δÏģο
-0.13
POSITIVE LOGITS
/dis
0.22
(dis
0.19
dis
0.19
Dis
0.18
Dis
0.18
Ĥæķ°
0.17
zell
0.17
zung
0.17
.dis
0.17
-dis
0.16
Activations Density 0.031%