INDEX
Explanations
terms related to discrimination and bias
New Auto-Interp
Negative Logits
ioned
-0.21
cn
-0.16
ittings
-0.15
loy
-0.15
pij
-0.14
anh
-0.14
orman
-0.14
jar
-0.14
ends
-0.14
oral
-0.14
POSITIVE LOGITS
Against
0.22
against
0.22
based
0.21
Against
0.20
against
0.20
Based
0.18
Based
0.16
272
0.16
Discrim
0.16
inating
0.15
Activations Density 0.017%