INDEX
Explanations
phrases or words that indicate something being disproportionately more or less than expected
discussions of disproportionate impacts or effects on various groups
New Auto-Interp
Negative Logits
ince
-0.77
ht
-0.74
ired
-0.72
uring
-0.72
held
-0.71
icist
-0.70
PT
-0.70
adal
-0.70
love
-0.70
ain
-0.69
POSITIVE LOGITS
disproportionately
1.07
disproportion
1.04
ãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³
0.83
disadvant
0.81
impacts
0.81
ãĤ¼ãĤ¦ãĤ¹
0.80
disadvantages
0.79
proport
0.78
adolesc
0.78
shenan
0.76
Activations Density 0.014%