INDEX
    Explanations

    based on protected characteristics

    New Auto-Interp
    Negative Logits
    ingle
    -0.13
     Weston
    -0.09
    kie
    -0.09
    icana
    -0.09
    ãİ
    -0.08
     Grim
    -0.08
     impartial
    -0.08
    iset
    -0.08
     Hers
    -0.08
    TextEdit
    -0.08
    POSITIVE LOGITS
     race
    0.19
    race
    0.15
     Race
    0.13
     grounds
    0.13
     their
    0.13
     gender
    0.12
     protected
    0.12
     skin
    0.12
    grounds
    0.12
    Race
    0.12
    Act Density 0.038%

    No Known Activations