INDEX
    Explanations

    words related to racial issues

    references to racial bias and disparities

    New Auto-Interp
    Negative Logits
    uden
    -0.91
    icular
    -0.87
    tower
    -0.86
    ertodd
    -0.85
    erva
    -0.80
    dra
    -0.79
    hower
    -0.78
    kens
    -0.76
    ATURE
    -0.76
    arent
    -0.75
    POSITIVE LOGITS
     slurs
    1.14
    ized
    0.99
     minorities
    0.95
     violence
    0.91
    ethnic
    0.91
     profiling
    0.89
     affili
    0.88
     purity
    0.87
     animosity
    0.86
     backgrounds
    0.85
    Act Density 0.015%

    No Known Activations