INDEX
    Explanations

    instances of verbal abuse or harassment

    New Auto-Interp
    Negative Logits
    arent
    -0.85
    roxy
    -0.83
    bard
    -0.82
    akeru
    -0.81
    enthal
    -0.81
    alach
    -0.79
    avorite
    -0.79
    xon
    -0.78
    ktop
    -0.75
    uden
    -0.74
    POSITIVE LOGITS
     altercation
    1.04
    ized
    0.98
     verbal
    0.97
    isations
    0.96
    izing
    0.92
    izations
    0.92
    ization
    0.91
     communication
    0.89
     abuse
    0.89
     spar
    0.86
    Act Density 0.007%

    No Known Activations