INDEX
    Explanations

    offensive language and insults

    derogatory terms and insults directed at individuals

    New Auto-Interp
    Negative Logits
    HCR
    -0.79
     srf
    -0.74
     rall
    -0.71
    BLIC
    -0.71
    ITED
    -0.71
    ONT
    -0.70
    ctica
    -0.70
    clerosis
    -0.70
     Pradesh
    -0.70
    isman
    -0.69
    POSITIVE LOGITS
     bitch
    0.83
    buster
    0.82
    posts
    0.81
    iness
    0.78
    post
    0.77
    fest
    0.76
    ings
    0.75
    dump
    0.74
    umin
    0.73
    enger
    0.72
    Act Density 0.014%

    No Known Activations