INDEX
    Explanations

    features or characteristics associated with individuals or groups

    words or phrases related to labels and stereotypes

    New Auto-Interp
    Negative Logits
    large
    -0.81
    usable
    -0.74
    ij
    -0.73
    range
    -0.73
    oho
    -0.71
    english
    -0.70
    ecause
    -0.69
    ensive
    -0.69
    CLOSE
    -0.68
    angan
    -0.68
    POSITIVE LOGITS
     extraord
    1.18
    gery
    1.05
    esses
    1.04
    hood
    1.03
    doms
    1.01
    ry
    0.95
     hordes
    0.94
    dom
    0.92
    isms
    0.91
    ism
    0.90
    Act Density 0.350%

    No Known Activations