INDEX
    Explanations

    words that refer to pronouns and their usage

    New Auto-Interp
    Negative Logits
    liš
    -0.15
     Secondary
    -0.15
     Pyramid
    -0.15
    ter
    -0.15
    neys
    -0.15
     doubly
    -0.14
     Pun
    -0.14
     secondary
    -0.14
    ori
    -0.14
     native
    -0.14
    POSITIVE LOGITS
     pron
    0.31
     Pron
    0.25
    pron
    0.23
     Singular
    0.20
    singular
    0.18
    azen
    0.17
     Us
    0.17
     demonstr
    0.17
    Us
    0.17
     singular
    0.17
    Act Density 0.035%

    No Known Activations