INDEX
    Explanations

    rhetorical questions and expressions of surprise

    New Auto-Interp
    Negative Logits
    loth
    -0.16
    lost
    -0.14
     Hey
    -0.14
     Yap
    -0.14
    lak
    -0.14
     Heaven
    -0.14
     ups
    -0.13
    alling
    -0.13
    .HTML
    -0.13
     اÙĦبÙĦ
    -0.13
    POSITIVE LOGITS
     WRONG
    0.33
     wrong
    0.32
     Wrong
    0.28
    Wrong
    0.28
     incorrect
    0.28
    wrong
    0.27
     оÑĪиб
    0.23
     Incorrect
    0.22
     mistaken
    0.22
    incorrect
    0.21
    Act Density 0.101%

    No Known Activations