INDEX
    Explanations

    words related to racial bias and discrimination

    references to racial issues and biases

    New Auto-Interp
    Negative Logits
    uden
    -0.95
    tower
    -0.91
    ertodd
    -0.82
    icular
    -0.82
    erva
    -0.81
    hower
    -0.80
    dra
    -0.79
    kens
    -0.75
    stadt
    -0.75
    arent
    -0.74
    POSITIVE LOGITS
     slurs
    1.18
    ized
    1.02
     affili
    0.95
     minorities
    0.95
     purity
    0.93
     violence
    0.92
     prejudice
    0.91
     profiling
    0.91
     animosity
    0.91
     tensions
    0.90
    Act Density 0.015%

    No Known Activations