INDEX
    Explanations

    phrases that contrast different ideas

    New Auto-Interp
    Negative Logits
    obyl
    -0.71
    \":
    -0.70
    reen
    -0.69
    Ñı
    -0.68
    esc
    -0.67
    ENG
    -0.65
    yn
    -0.64
    agan
    -0.63
    omore
    -0.62
    avour
    -0.62
    POSITIVE LOGITS
     nevertheless
    1.16
     hey
    1.06
     nonetheless
    1.04
     alas
    0.98
    tons
    0.90
     fortunately
    0.82
     surely
    0.80
     suffice
    0.79
     luckily
    0.75
     damn
    0.74
    Act Density 0.146%

    No Known Activations