INDEX
Explanations
terms and phrases associated with white supremacy and racist ideologies
New Auto-Interp
Negative Logits
enson
-0.18
olie
-0.16
stro
-0.15
jang
-0.15
rupa
-0.15
Dash
-0.15
pleted
-0.15
Verm
-0.14
rect
-0.14
inent
-0.14
POSITIVE LOGITS
imizer
0.15
Sadd
0.14
praak
0.14
ì
0.14
Sortable
0.14
vir
0.14
ablish
0.14
?option
0.13
edom
0.13
ître
0.13
Activations Density 0.048%