INDEX
    Explanations

    references to the speaker or first-person perspectives

    New Auto-Interp
    Negative Logits
    cliffe
    -0.19
    nya
    -0.18
    rous
    -0.17
    stein
    -0.17
    so
    -0.17
    n
    -0.16
     themselves
    -0.16
    lass
    -0.16
     itself
    -0.16
    l
    -0.16
    POSITIVE LOGITS
    /us
    0.38
    SELF
    0.23
    /her
    0.23
    adows
    0.22
    asuring
    0.20
    zzo
    0.19
    andering
    0.18
    zelf
    0.18
    -même
    0.18
    ury
    0.17
    Act Density 0.077%

    No Known Activations