INDEX
    Explanations

    references to the pronoun "it."

    New Auto-Interp
    Negative Logits
    ofire
    -0.17
    shr
    -0.15
    ailles
    -0.15
    uzey
    -0.15
     shr
    -0.15
    oload
    -0.14
    pq
    -0.14
    anax
    -0.14
    eniable
    -0.14
    ranÃŃ
    -0.14
    POSITIVE LOGITS
    orden
    0.16
    asca
    0.16
     cheat
    0.15
    bow
    0.15
    vet
    0.14
    aml
    0.14
    erro
    0.14
    çͲ
    0.14
     Ted
    0.14
    ord
    0.14
    Act Density 0.213%

    No Known Activations