INDEX
    Explanations

    the word "Instead of" followed by an action that goes against the expected or traditional response

    New Auto-Interp
    Negative Logits
    Nap
    -0.68
    essen
    -0.66
    erto
    -0.66
     nonetheless
    -0.65
    read
    -0.64
    ENE
    -0.64
     Palestin
    -0.63
     nevertheless
    -0.63
    artisan
    -0.63
     veter
    -0.62
    POSITIVE LOGITS
     anything
    0.82
     bothering
    0.74
     relying
    0.74
     ours
    0.73
     being
    0.72
     sul
    0.72
     dwelling
    0.72
     excuses
    0.70
     focusing
    0.68
     outright
    0.67
    Act Density 0.033%

    No Known Activations