INDEX
Explanations
based on protected characteristics
New Auto-Interp
Negative Logits
ingle
-0.13
Weston
-0.09
kie
-0.09
icana
-0.09
ãİ
-0.08
Grim
-0.08
impartial
-0.08
iset
-0.08
Hers
-0.08
TextEdit
-0.08
POSITIVE LOGITS
race
0.19
race
0.15
Race
0.13
grounds
0.13
their
0.13
gender
0.12
protected
0.12
skin
0.12
grounds
0.12
Race
0.12
Activations Density 0.038%