INDEX
Explanations
derogatory language and discriminatory remarks toward individuals or groups
derogatory language and offensive comments
New Auto-Interp
Negative Logits
Luck
-0.75
ELY
-0.70
oglu
-0.70
arten
-0.69
Luck
-0.69
INC
-0.68
ellect
-0.67
ederal
-0.66
Mechan
-0.65
UNCH
-0.65
POSITIVE LOGITS
slurs
1.09
lewd
1.07
uttered
0.98
derogatory
0.97
harassing
0.97
insulting
0.96
inappropriately
0.96
inappropriate
0.94
indecent
0.93
homophobic
0.92
Activations Density 0.340%