INDEX
Explanations
derogatory terms or insults directed towards individuals
derogatory terms and insults aimed at individuals or groups
New Auto-Interp
Negative Logits
conclud
-0.72
Printed
-0.71
survives
-0.71
Surviv
-0.70
foreseen
-0.68
ortality
-0.67
traject
-0.67
ORDER
-0.66
igsaw
-0.64
longitudinal
-0.63
POSITIVE LOGITS
liar
1.03
coward
0.93
irresponsible
0.92
traitor
0.91
unfit
0.90
hypocr
0.89
unworthy
0.87
disgrace
0.84
disrespectful
0.82
insensitive
0.81
Activations Density 0.387%