INDEX
    Explanations

    contradictions or nuances in statements

    New Auto-Interp
    Negative Logits
     but
    -0.22
     maar
    -0.17
    but
    -0.17
     αλλά
    -0.16
     sice
    -0.16
     mais
    -0.16
    ãģłãģĮ
    -0.15
    edb
    -0.15
    ãģ§ãģĻãģĮ
    -0.15
     btw
    -0.15
    POSITIVE LOGITS
     nevertheless
    0.74
     nonetheless
    0.73
    Nevertheless
    0.60
     Nonetheless
    0.54
     Nevertheless
    0.53
     anyway
    0.44
     toch
    0.42
    Anyway
    0.40
     Anyway
    0.38
    è¿ĺæĺ¯
    0.37
    Act Density 0.564%

    No Known Activations