INDEX
Explanations
phrases and words that describe disparities, inequalities, or imbalances
instances of disparity or disproportionate impact on various groups or issues
New Auto-Interp
Negative Logits
uring
-0.80
ince
-0.74
erm
-0.73
love
-0.72
adal
-0.71
shire
-0.70
ures
-0.70
zyme
-0.69
psons
-0.69
DCS
-0.68
POSITIVE LOGITS
disproportionately
0.93
disproportion
0.92
disadvantage
0.84
disadvant
0.79
favoring
0.78
disenfranch
0.77
disadvantages
0.74
representation
0.72
aga
0.70
benefiting
0.69
Activations Density 0.049%