INDEX
    Explanations

    Racial stereotypes

    New Auto-Interp
    Negative Logits
    .WARNING
    -0.07
     Kaf
    -0.06
    -0.06
    ATABASE
    -0.06
     يكون
    -0.06
     воздейств
    -0.06
    лок
    -0.06
     injust
    -0.06
    mh
    -0.06
    /div
    -0.06
    POSITIVE LOGITS
     Sweep
    0.07
    Quote
    0.07
    Despite
    0.06
     battery
    0.06
    oops
    0.06
     Marlins
    0.06
     forKey
    0.06
    filters
    0.06
     epic
    0.06
    수로
    0.06
    Act Density 0.004%

    No Known Activations