INDEX
Explanations
phrases related to discrimination and bias, particularly based on gender and race
New Auto-Interp
Negative Logits
anax
-0.16
corner
-0.15
ÅĻi
-0.15
linkplain
-0.15
nej
-0.15
iej
-0.15
AMESPACE
-0.14
iesel
-0.14
tslib
-0.14
/Foundation
-0.14
POSITIVE LOGITS
race
0.32
age
0.30
gender
0.27
Race
0.25
Race
0.25
race
0.24
Age
0.23
ethnicity
0.23
age
0.23
sex
0.22
Activations Density 0.190%