INDEX
    Explanations

    phrases indicating surprise or unexpected realizations

    New Auto-Interp
    Negative Logits
    alo
    -0.14
    lue
    -0.13
    nock
    -0.13
    lej
    -0.13
    eated
    -0.13
    ÑĮÑİ
    -0.13
    getti
    -0.12
    nul
    -0.12
    /single
    -0.12
    orang
    -0.12
    POSITIVE LOGITS
     thought
    0.48
     expected
    0.43
    thought
    0.42
     Thought
    0.42
    expected
    0.37
    Thought
    0.36
     anticipated
    0.34
     hoped
    0.33
    Expected
    0.33
     assumed
    0.33
    Act Density 0.362%

    No Known Activations