INDEX
Explanations
mentions of racial issues or disparities
terms related to racial issues and discrimination
New Auto-Interp
Negative Logits
ertodd
-0.90
tower
-0.80
kens
-0.78
ipop
-0.78
20439
-0.78
amina
-0.75
stadt
-0.75
icular
-0.75
hran
-0.74
uden
-0.73
POSITIVE LOGITS
slurs
1.26
ized
1.10
minorities
1.02
prejudice
1.01
disparities
0.99
profiling
0.98
discrimination
0.98
disparity
0.95
animosity
0.94
stereotypes
0.94
Activations Density 0.035%