INDEX
    Explanations

    mentions of people of color

    references to marginalized groups, specifically people of color

    New Auto-Interp
    Negative Logits
    Xi
    -0.78
     Niet
    -0.72
    ãĤ´
    -0.72
    WAR
    -0.71
     Nex
    -0.69
    ertodd
    -0.68
    ERG
    -0.68
     Sut
    -0.68
    sg
    -0.67
    chn
    -0.67
    POSITIVE LOGITS
    blind
    0.85
    anguage
    0.83
     minorities
    0.73
     coded
    0.69
     backgrounds
    0.69
     queer
    0.68
     color
    0.67
     slurs
    0.67
     stripes
    0.67
    ="#
    0.65
    Act Density 0.012%

    No Known Activations