INDEX
    Explanations

    terms and concepts related to discrimination and bias

    New Auto-Interp
    Negative Logits
    lify
    -0.16
    liers
    -0.15
    aho
    -0.15
    osphere
    -0.14
    appa
    -0.14
    ifier
    -0.14
    comings
    -0.14
    íĮĮ
    -0.14
    rades
    -0.14
    íĴĪ
    -0.14
    POSITIVE LOGITS
     against
    0.28
     toward
    0.27
     towards
    0.26
     based
    0.25
     Against
    0.24
    against
    0.23
    Against
    0.23
     experienced
    0.22
     Towards
    0.22
     Based
    0.20
    Act Density 0.040%

    No Known Activations