INDEX
    Explanations

    negative statements or controversies related to public figures

    derogatory terms and phrases related to social issues and public figures

    New Auto-Interp
    Negative Logits
     prepar
    -0.70
     Incre
    -0.63
    igree
    -0.62
    vantage
    -0.61
    eworks
    -0.59
    yrinth
    -0.59
    rieve
    -0.59
    accompan
    -0.59
    ilitation
    -0.58
     synerg
    -0.58
    POSITIVE LOGITS
     sexist
    1.26
     racist
    1.19
     homophobic
    1.18
     misogyny
    1.18
     misogyn
    1.17
     racists
    1.14
     sexism
    1.10
     feminists
    1.10
     homophobia
    1.10
     slurs
    1.09
    Act Density 1.010%

    No Known Activations