INDEX
Explanations
terms related to racial bias or discrimination
topics related to race and racial issues
New Auto-Interp
Negative Logits
uden
-0.97
icular
-0.83
tower
-0.81
20439
-0.80
ertodd
-0.78
amina
-0.76
rov
-0.76
dra
-0.76
debian
-0.75
OHN
-0.74
POSITIVE LOGITS
slurs
1.16
ized
0.99
minorities
0.98
profiling
0.95
racial
0.95
caste
0.94
violence
0.93
stereotypes
0.93
discrimination
0.92
affili
0.91
Activations Density 0.015%