INDEX
Explanations
references to extremist groups and hate-related content, especially related to the Ku Klux Klan
references to hate groups and associated terms
New Auto-Interp
Negative Logits
Downloadha
-0.81
ded
-0.77
dra
-0.77
kj
-0.76
*/(
-0.75
neau
-0.75
gob
-0.74
ochond
-0.74
til
-0.73
abilities
-0.72
POSITIVE LOGITS
Klux
1.23
Klan
1.12
KKK
0.95
Sabha
0.81
affiliation
0.77
affili
0.74
NAACP
0.73
robes
0.73
Beir
0.72
Jr
0.71
Activations Density 0.011%