INDEX
    Explanations

    references to LGBTQ+ identities and specifically terms related to "gay."

    New Auto-Interp
    Negative Logits
    inus
    -0.17
    cker
    -0.17
    帯
    -0.16
    rál
    -0.16
    sg
    -0.16
    ahy
    -0.15
    sk
    -0.15
    iams
    -0.15
    ÑģÑĤин
    -0.15
    ейÑģÑĤв
    -0.15
    POSITIVE LOGITS
    lord
    0.30
    dar
    0.26
    atri
    0.25
    -rights
    0.24
    bor
    0.22
    lords
    0.21
     rights
    0.21
    -friendly
    0.20
    ety
    0.20
    est
    0.20
    Act Density 0.010%

    No Known Activations