INDEX
    Explanations

    The neuron activates on words naming protected demographic characteristics (e.g. race, gender, age, religion, ethnicity).

    New Auto-Interp
    Negative Logits
    NSError
    -0.06
     эксп
    -0.06
     qint
    -0.06
    -0.06
    Na
    -0.06
     subcontract
    -0.06
     fab
    -0.06
    openhagen
    -0.06
     Ernst
    -0.06
    wort
    -0.06
    POSITIVE LOGITS
     interracial
    0.09
     Race
    0.09
     racially
    0.09
     racial
    0.09
    racial
    0.09
     race
    0.08
     Religion
    0.07
    acial
    0.07
    IAL
    0.07
    classList
    0.07
    Act Density 0.007%

    No Known Activations