INDEX
    Explanations

    negative and aggressive language, including death threats and hate-filled messages

    New Auto-Interp
    Negative Logits
     shenan
    -1.03
     hairc
    -1.02
     juges
    -1.01
     ecru
    -1.01
     négociations
    -0.98
    <bos>
    -0.95
     plais
    -0.94
     récompenses
    -0.93
     réunions
    -0.93
     vœux
    -0.93
    POSITIVE LOGITS
     unexpected
    0.61
     discussions
    0.61
     occasional
    0.59
    eclamp
    0.58
     Palembang
    0.57
    wareness
    0.57
    intenance
    0.56
     caña
    0.56
     heridos
    0.56
     prayers
    0.55
    Act Density 0.566%

    No Known Activations