INDEX
    Explanations

    references to identity and pronouns in the context of gender

    New Auto-Interp
    Negative Logits
    èĥĨ
    -0.07
    ãĥ¡ãĥ©
    -0.06
    หล
    -0.06
    _GAP
    -0.06
    ška
    -0.06
    atoi
    -0.06
    uren
    -0.06
    arse
    -0.06
    lien
    -0.06
    -ignore
    -0.06
    POSITIVE LOGITS
    dana
    0.08
     precision
    0.08
    avoid
    0.07
     sensitivity
    0.07
     avoid
    0.07
     usage
    0.07
     sensitive
    0.07
     respectful
    0.07
     sensit
    0.07
    .scalablytyped
    0.07
    Act Density 0.007%

    No Known Activations