INDEX
Explanations
mentions of discrimination or inequality against minorities
references to marginalized and minority groups
New Auto-Interp
Negative Logits
ENA
-0.82
FIN
-0.82
CHA
-0.76
amina
-0.76
rol
-0.74
×ŀ
-0.73
ר
-0.72
PT
-0.72
rolog
-0.71
×
-0.71
POSITIVE LOGITS
minorities
1.08
genders
0.99
rats
0.89
minority
0.83
backgrounds
0.80
eatures
0.79
ecided
0.77
whites
0.77
unemploy
0.75
males
0.75
Activations Density 0.005%