INDEX
Explanations
words related to racial issues
references to racial bias and disparities
New Auto-Interp
Negative Logits
uden
-0.91
icular
-0.87
tower
-0.86
ertodd
-0.85
erva
-0.80
dra
-0.79
hower
-0.78
kens
-0.76
ATURE
-0.76
arent
-0.75
POSITIVE LOGITS
slurs
1.14
ized
0.99
minorities
0.95
violence
0.91
ethnic
0.91
profiling
0.89
affili
0.88
purity
0.87
animosity
0.86
backgrounds
0.85
Activations Density 0.015%