INDEX
    Explanations

    offensive language

    derogatory terms directed at individuals in discussions about social issues

    New Auto-Interp
    Negative Logits
    ļéĨĴ
    -0.87
    »Ĵ
    -0.77
    tions
    -0.73
     streamlined
    -0.67
     exting
    -0.66
     srf
    -0.65
    ActionCode
    -0.63
    Reviewer
    -0.63
    ă
    -0.61
    ü
    -0.61
    POSITIVE LOGITS
     congratulations
    0.73
     sorry
    0.71
     surely
    0.67
     kidding
    0.67
     damned
    0.67
    sorry
    0.66
    Wr
    0.66
     Well
    0.65
     Wrong
    0.64
     liar
    0.62
    Act Density 0.326%

    No Known Activations