INDEX
    Explanations

    expressions of surprise or amazement

    New Auto-Interp
    Negative Logits
    loor
    -0.18
    atör
    -0.16
    егоÑĢ
    -0.15
    istrov
    -0.15
    enschaft
    -0.15
    令
    -0.15
    avou
    -0.15
    blr
    -0.14
    ergarten
    -0.14
    inel
    -0.14
    POSITIVE LOGITS
    zers
    0.27
    zer
    0.24
    za
    0.22
    zas
    0.18
    talk
    0.18
    -factor
    0.17
     Factor
    0.17
    indr
    0.17
    /flutter
    0.16
     factor
    0.16
    Act Density 0.016%

    No Known Activations