INDEX
    Explanations

    pronouns and the actions associated with them

    New Auto-Interp
    Negative Logits
    lys
    -0.17
     thereby
    -0.16
    arse
    -0.16
    orf
    -0.15
    ucu
    -0.14
    indy
    -0.14
    stag
    -0.14
    dal
    -0.14
    ibt
    -0.14
    beg
    -0.13
    POSITIVE LOGITS
    ê¶ģ
    0.15
     alone
    0.15
    enthal
    0.15
    imits
    0.14
    endoza
    0.14
    itted
    0.14
    _tE
    0.14
    IFn
    0.14
    _tF
    0.14
    cade
    0.14
    Act Density 0.278%

    No Known Activations