INDEX
    Explanations

    words related to negative or harmful behavior or actions, such as abusive, deceptive, and oppressive

    terms associated with abusive or harmful behaviors and practices

    New Auto-Interp
    Negative Logits
    ild
    -0.92
    igating
    -0.87
    inen
    -0.86
    igated
    -0.85
    ighed
    -0.85
    igate
    -0.84
    osal
    -0.83
    oleon
    -0.83
    izen
    -0.82
    Downloadha
    -0.80
    POSITIVE LOGITS
     abusive
    1.26
     citiz
    0.88
     behav
    0.83
     behaviour
    0.83
     oppressive
    0.81
     undermin
    0.78
    volent
    0.77
     tendencies
    0.77
     discriminatory
    0.77
     minded
    0.75
    Act Density 0.021%

    No Known Activations