INDEX
Explanations
terms related to racial issues and discrimination
references to racial issues and discrimination
New Auto-Interp
Negative Logits
uden
-0.98
amina
-0.82
tower
-0.81
oning
-0.80
rov
-0.80
icular
-0.78
20439
-0.78
dra
-0.77
ertodd
-0.76
debian
-0.76
POSITIVE LOGITS
slurs
1.08
minorities
0.99
racial
0.97
ized
0.93
caste
0.93
profiling
0.89
affili
0.88
backgrounds
0.86
discrimination
0.84
stereotypes
0.84
Activations Density 0.015%