INDEX
    Explanations

    terms related to discrimination and bias

    New Auto-Interp
    Negative Logits
    ioned
    -0.21
    cn
    -0.16
    ittings
    -0.15
    loy
    -0.15
     pij
    -0.14
    anh
    -0.14
    orman
    -0.14
    jar
    -0.14
    ends
    -0.14
    oral
    -0.14
    POSITIVE LOGITS
    Against
    0.22
     against
    0.22
     based
    0.21
     Against
    0.20
    against
    0.20
     Based
    0.18
    Based
    0.16
    272
    0.16
     Discrim
    0.16
    inating
    0.15
    Act Density 0.017%

    No Known Activations