INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     one
    -1.50
     which
    -1.43
     for
    -1.18
     both
    -1.11
     at
    -1.03
     while
    -1.02
     a
    -0.99
     melawan
    -0.94
     if
    -0.93
     other
    -0.93
    POSITIVE LOGITS
     wszystko
    1.09
     eftersom
    0.99
    wußt
    0.97
     joining
    0.93
     ponieważ
    0.93
    顺着
    0.92
     undertaking
    0.92
    jot
    0.92
     bén
    0.92
    tillation
    0.91
    Act Density 0.043%

    No Known Activations