INDEX
Explanations
mentions or discussions of racist behaviors or beliefs
occurrences and discussions of racism
New Auto-Interp
Negative Logits
hower
-0.81
Delivery
-0.81
pad
-0.80
amina
-0.80
icular
-0.80
RH
-0.79
tis
-0.78
avez
-0.78
ership
-0.77
weeney
-0.76
POSITIVE LOGITS
slurs
1.15
racist
0.95
stereotyp
0.91
nationalist
0.90
racists
0.90
prejudice
0.89
stereotypes
0.88
caric
0.86
tir
0.86
sexist
0.86
Activations Density 0.029%