INDEX
    Explanations

    phrases indicating expectation or surprise

    New Auto-Interp
    Negative Logits
    esub
    -0.18
    adients
    -0.17
    ê¼
    -0.15
    oir
    -0.15
    gon
    -0.15
    ghi
    -0.15
    .opens
    -0.15
    à¥įदर
    -0.14
    -League
    -0.14
    okable
    -0.14
    POSITIVE LOGITS
     comes
    0.41
     come
    0.39
     Come
    0.34
    come
    0.33
    comes
    0.32
    Come
    0.31
     came
    0.28
     Comes
    0.28
    æĿ¥
    0.26
    ä¾Ĩ
    0.24
    Act Density 0.019%

    No Known Activations