INDEX
    Explanations

    simple vs complex outcome

    New Auto-Interp
    Negative Logits
     частично
    0.42
    الله
    0.40
     हद
    0.39
     nejen
    0.38
    毫无
    0.37
    不仅
    0.37
    ahoo
    0.36
    无疑
    0.36
    0.36
    OGRAM
    0.36
    POSITIVE LOGITS
     pourtant
    0.91
     nevertheless
    0.88
     nonetheless
    0.79
     trotzdem
    0.79
     disproportion
    0.78
     dennoch
    0.76
     impactful
    0.69
     Nonetheless
    0.68
    Nevertheless
    0.66
     Nevertheless
    0.65
    Act Density 0.019%

    No Known Activations