INDEX
Explanations
statements or phrases related to racism, specifically when the term "racist" is mentioned or implied
references to racism and racist behavior
New Auto-Interp
Negative Logits
ITNESS
-0.85
icular
-0.82
pad
-0.81
amina
-0.79
Delivery
-0.72
eenth
-0.70
ATURE
-0.70
marks
-0.70
ieth
-0.70
imen
-0.70
POSITIVE LOGITS
slurs
1.18
prejudice
0.94
stereotypes
0.93
stereotyp
0.91
slur
0.91
stereotype
0.89
tir
0.86
racist
0.80
racists
0.80
bigot
0.80
Activations Density 0.050%