INDEX
Explanations
words related to racial bias and discrimination
references to racial issues and biases
New Auto-Interp
Negative Logits
uden
-0.95
tower
-0.91
ertodd
-0.82
icular
-0.82
erva
-0.81
hower
-0.80
dra
-0.79
kens
-0.75
stadt
-0.75
arent
-0.74
POSITIVE LOGITS
slurs
1.18
ized
1.02
affili
0.95
minorities
0.95
purity
0.93
violence
0.92
prejudice
0.91
profiling
0.91
animosity
0.91
tensions
0.90
Activations Density 0.015%