INDEX
Explanations
explicit mentions of racism
terms related to racism and accusations of racist behavior
New Auto-Interp
Negative Logits
ITNESS
-0.88
pad
-0.83
icular
-0.78
amina
-0.77
earchers
-0.71
ATURE
-0.71
arios
-0.71
Delivery
-0.71
itness
-0.69
stantial
-0.69
POSITIVE LOGITS
slurs
1.27
prejudice
1.02
bigot
0.96
tir
0.96
slur
0.93
racists
0.92
hatred
0.92
racist
0.91
stereotyp
0.91
stereotypes
0.91
Activations Density 0.047%