INDEX
    Explanations

    negative or offensive language and comments

    references to offensive comments and remarks

    New Auto-Interp
    Negative Logits
    oglu
    -0.74
    negie
    -0.73
    prus
    -0.72
    sonian
    -0.71
    Luck
    -0.69
    UNCH
    -0.69
    iets
    -0.69
    inav
    -0.69
    aer
    -0.68
     Luck
    -0.68
    POSITIVE LOGITS
     inappropriate
    1.13
     inappropriately
    1.13
     slurs
    1.09
     lewd
    1.05
     disrespectful
    1.04
     indecent
    1.02
     abusive
    1.01
     harassing
    1.00
     misogyn
    0.95
     uttered
    0.94
    Act Density 0.349%

    No Known Activations