INDEX
    Explanations

    personal pronouns and expressions of self-identity

    New Auto-Interp
    Negative Logits
    wards
    -0.17
    UBL
    -0.16
    772
    -0.16
     Civ
    -0.16
    ardo
    -0.14
    ubl
    -0.14
    uctor
    -0.14
    ecta
    -0.14
    atk
    -0.14
    ält
    -0.14
    POSITIVE LOGITS
     aside
    0.37
     into
    0.29
     Aside
    0.27
    aside
    0.27
     together
    0.26
     INTO
    0.22
     forward
    0.22
    Aside
    0.22
    atively
    0.21
     Into
    0.21
    Act Density 0.050%

    No Known Activations