INDEX
    Explanations

    references to structures or frameworks that could signify oppression or confinement

    New Auto-Interp
    Negative Logits
    adays
    -0.17
    owitz
    -0.16
    [s
    -0.16
    razier
    -0.16
    weeney
    -0.15
    ettes
    -0.14
    ziej
    -0.14
    wayne
    -0.14
    worthy
    -0.14
    (s
    -0.14
    POSITIVE LOGITS
    une
    0.19
    ild
    0.18
    els
    0.18
    ints
    0.17
    ils
    0.17
    ads
    0.17
    unc
    0.17
    ips
    0.17
    iter
    0.16
    icer
    0.16
    Act Density 0.007%

    No Known Activations