INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     nave
    -0.08
     가진
    -0.08
     نزد
    -0.07
     quen
    -0.07
     зад
    -0.07
     kategor
    -0.07
     چیزی
    -0.07
     Nak
    -0.07
     lign
    -0.06
    esub
    -0.06
    POSITIVE LOGITS
    types
    0.07
     Caesar
    0.06
     Strait
    0.06
    sorting
    0.06
     Warp
    0.06
     Sanity
    0.06
    script
    0.06
    /result
    0.06
    0.06
    oops
    0.06
    Act Density 0.007%

    No Known Activations