INDEX
Explanations
concepts related to discrimination and bias based on identity characteristics
New Auto-Interp
Negative Logits
linkplain
-0.16
ÙĪÙĦات
-0.16
grily
-0.16
oles
-0.15
oise
-0.15
ModelIndex
-0.15
QualifiedName
-0.15
á»ĥn
-0.14
eselect
-0.14
Ñĥмов
-0.14
POSITIVE LOGITS
race
0.33
race
0.26
gender
0.25
Race
0.24
Race
0.23
nationality
0.23
sex
0.22
status
0.22
religion
0.22
age
0.21
Activations Density 0.086%