INDEX
    Explanations

    The neuron flags words appearing in legal non-discrimination statements, especially terms naming protected categories (e.g. sex, race, disability, etc.).

    New Auto-Interp
    Negative Logits
     shl
    -0.07
    -
    -0.07
    '&&
    -0.07
     srd
    -0.06
    DAO
    -0.06
    acht
    -0.06
    RAL
    -0.06
    #####↵
    -0.06
     Between
    -0.06
    _Al
    -0.06
    POSITIVE LOGITS
     alleging
    0.06
    _pedido
    0.06
    وئ
    0.06
    .control
    0.06
     Ph
    0.06
     توص
    0.06
    Ћ
    0.06
     ط
    0.06
    Boundary
    0.06
    -status
    0.06
    Act Density 0.003%

    No Known Activations