INDEX
Explanations
reinforce harmful societal biases
New Auto-Interp
Negative Logits
voluptate
0.75
consom
0.72
耗
0.69
aturamento
0.69
Tark
0.68
ইনিংস
0.68
ভিশ
0.67
Timelapse
0.67
hadronic
0.67
ផ្គ
0.67
POSITIVE LOGITS
discrimination
3.24
discriminatory
2.98
Discrimination
2.86
prejudice
2.81
bias
2.64
discriminate
2.63
discrimin
2.59
discrim
2.58
racism
2.50
discriminated
2.49
Activations Density 0.609%