INDEX
    Explanations

    characteristics related to human identity or personal attributes like race, ethnicity, religion, sexual orientation, nationality, and physical features

    terms related to identity and discrimination based on various characteristics

    New Auto-Interp
    Negative Logits
    EMS
    -0.76
    imar
    -0.76
     Oper
    -0.75
    ERG
    -0.69
    INTER
    -0.67
    vae
    -0.66
    aug
    -0.66
    ald
    -0.66
    Dispatch
    -0.65
     Rak
    -0.65
    POSITIVE LOGITS
     ancestry
    0.88
     affiliation
    0.86
     backgrounds
    0.85
     ethnicity
    0.81
     coloring
    0.77
     discrimination
    0.77
     prejudice
    0.76
     affili
    0.76
     preference
    0.76
     stripe
    0.74
    Act Density 0.096%

    No Known Activations