INDEX
Explanations
statements related to unequal treatment based on race, gender, or sexual orientation
references to discrimination and unfair treatment of marginalized groups
New Auto-Interp
Negative Logits
pex
-0.69
Knot
-0.68
helium
-0.67
leak
-0.61
Sync
-0.61
ellipt
-0.60
bluff
-0.60
logo
-0.59
Oper
-0.58
Tycoon
-0.57
POSITIVE LOGITS
discrimination
0.98
ardless
0.95
reatment
0.94
discriminated
0.90
due
0.89
unfairly
0.85
afforded
0.84
academ
0.83
unjust
0.82
irrespective
0.82
Activations Density 0.322%