INDEX
    Explanations

    equality and fairness

    New Auto-Interp
    Negative Logits
    -door
    -0.08
    -code
    -0.08
     calific
    -0.08
     घोषणा
    -0.08
    र्जी
    -0.08
    роб
    -0.07
    >Error
    -0.07
    dish
    -0.07
     바로
    -0.07
     Door
    -0.07
    POSITIVE LOGITS
    公平
    0.13
     fairness
    0.13
     evenly
    0.12
     equitable
    0.12
     equally
    0.11
     balanced
    0.10
    Fair
    0.10
     unbiased
    0.09
     fair
    0.09
    Balanced
    0.09
    Act Density 0.019%

    No Known Activations