INDEX
    Explanations

    references to defamation and abusive language

    words or phrases signaling abusive, insulting, derogatory, offensive, or otherwise disparaging language.

    New Auto-Interp
    Negative Logits
    ValueStyle
    -0.79
    +#+#
    -0.72
    DockStyle
    -0.70
     Wicidata
    -0.70
     nakalista
    -0.63
    __))
    -0.57
    __":
    
    -0.56
    ofern
    -0.56
     –,
    -0.56
    SOUNDBITE
    -0.56
    POSITIVE LOGITS
     degrading
    0.71
     defamation
    0.67
     slander
    0.63
     insults
    0.63
     dispar
    0.62
     insulting
    0.62
     attacks
    0.61
     insult
    0.60
     hurtful
    0.60
     targeting
    0.58
    Act Density 0.282%

    No Known Activations