INDEX
    Explanations

    instances of abusive or threatening behavior

    Offensive or abusive language

    New Auto-Interp
    Negative Logits
     发表于
    -0.58
     parad
    -0.57
    __':
    
    -0.57
    coltà
    -0.56
     Voluntary
    -0.53
    []
    
    -0.52
     preved
    -0.51
     Italijanski
    -0.51
     panik
    -0.51
    ύπ
    -0.51
    POSITIVE LOGITS
     insults
    1.30
     harassment
    1.20
     insulting
    1.17
     insult
    1.13
     bullying
    1.06
     insulted
    1.05
     harassing
    0.96
     tau
    0.95
     threats
    0.94
     attacks
    0.94
    Act Density 0.465%

    No Known Activations