INDEX
    Explanations

    negative social interactions related to controversy and offensive behavior

    language related to insults and derogatory remarks

    New Auto-Interp
    Negative Logits
    tnc
    -0.75
     Instit
    -0.73
    profits
    -0.73
    uph
    -0.69
    doi
    -0.69
    natureconservancy
    -0.69
    Effect
    -0.68
     Architects
    -0.67
    nav
    -0.67
    oha
    -0.67
    POSITIVE LOGITS
     slurs
    1.72
     insults
    1.66
     derogatory
    1.52
     homophobic
    1.51
     vulgar
    1.42
     racist
    1.37
     sexist
    1.36
     abusive
    1.36
     sarcastic
    1.34
     hateful
    1.33
    Act Density 0.358%

    No Known Activations