INDEX
Explanations
references to defamation and abusive language
words or phrases signaling abusive, insulting, derogatory, offensive, or otherwise disparaging language.
New Auto-Interp
Negative Logits
ValueStyle
-0.79
+#+#
-0.72
DockStyle
-0.70
Wicidata
-0.70
nakalista
-0.63
__))
-0.57
__":
-0.56
ofern
-0.56
–,
-0.56
SOUNDBITE
-0.56
POSITIVE LOGITS
degrading
0.71
defamation
0.67
slander
0.63
insults
0.63
dispar
0.62
insulting
0.62
attacks
0.61
insult
0.60
hurtful
0.60
targeting
0.58
Activations Density 0.282%