INDEX
Explanations
words related to racism
references to racism and related accusations
New Auto-Interp
Negative Logits
tis
-0.96
trak
-0.88
amina
-0.86
oning
-0.84
ITNESS
-0.79
rolog
-0.78
RH
-0.78
icular
-0.77
aple
-0.76
aver
-0.75
POSITIVE LOGITS
racist
1.18
racists
1.14
slurs
1.03
racism
0.97
homophobic
0.96
nationalist
0.95
sexist
0.93
stereotypes
0.92
caric
0.91
supremacist
0.90
Activations Density 0.014%