INDEX
Explanations
mentions of racial issues or concepts in various contexts
references to racial issues and profiling
New Auto-Interp
Negative Logits
uden
-0.90
icular
-0.87
ertodd
-0.86
tower
-0.85
hower
-0.79
erva
-0.78
dra
-0.78
arent
-0.77
rov
-0.77
etsk
-0.76
POSITIVE LOGITS
slurs
1.15
ized
1.00
minorities
0.98
profiling
0.94
violence
0.93
discrimination
0.91
supremacists
0.90
affili
0.89
stereotypes
0.88
Equality
0.88
Activations Density 0.013%