INDEX
    Explanations

    words related to negative behavior such as abusive, rude, derogatory, and hateful

    language related to abusive or harmful behavior

    New Auto-Interp
    Negative Logits
    oleon
    -0.93
    obyl
    -0.92
    zzo
    -0.88
    DragonMagazine
    -0.85
    igham
    -0.85
    Downloadha
    -0.84
    zig
    -0.83
    ortal
    -0.83
    ariat
    -0.82
    akeru
    -0.82
    POSITIVE LOGITS
     behav
    1.06
     behaviour
    0.94
     abusive
    0.86
     behavior
    0.85
    soever
    0.83
     alien
    0.82
     distractions
    0.80
     aspects
    0.80
     undermin
    0.79
     slurs
    0.78
    Act Density 0.032%

    No Known Activations