INDEX
Negative Logits
Gender
0.59
unfair
0.54
gender
0.52
Gender
0.52
Bias
0.51
inclus
0.49
biasing
0.48
bias
0.47
injust
0.47
Inclusive
0.45
POSITIVE LOGITS
racial
1.64
interracial
1.45
racially
1.43
racial
1.41
racist
1.39
Racial
1.36
racism
1.27
rac
1.12
Racism
1.09
white
1.05
Activations Density 0.033%