INDEX
    Explanations

    phrases related to discrimination and bias, particularly based on gender and race

    New Auto-Interp
    Negative Logits
    anax
    -0.16
    corner
    -0.15
    ÅĻi
    -0.15
    linkplain
    -0.15
    nej
    -0.15
    iej
    -0.15
    AMESPACE
    -0.14
    iesel
    -0.14
     tslib
    -0.14
    /Foundation
    -0.14
    POSITIVE LOGITS
     race
    0.32
     age
    0.30
     gender
    0.27
     Race
    0.25
    Race
    0.25
    race
    0.24
     Age
    0.23
     ethnicity
    0.23
    age
    0.23
     sex
    0.22
    Act Density 0.190%

    No Known Activations