INDEX
    Explanations

    expressions of surprise or realization

    New Auto-Interp
    Negative Logits
    apo
    -0.18
    ãĥ¼ãĥĭ
    -0.17
    apor
    -0.17
    icator
    -0.16
    fik
    -0.15
    ero
    -0.15
    ators
    -0.15
    encer
    -0.15
    eways
    -0.15
    _OC
    -0.14
    POSITIVE LOGITS
    annes
    0.17
     bother
    0.16
     snap
    0.16
     Snap
    0.16
     yes
    0.15
    irsch
    0.15
    rens
    0.15
    sm
    0.15
     yeah
    0.14
    Äįan
    0.14
    Act Density 0.015%

    No Known Activations