INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    hest
    -0.08
    مان
    -0.08
    mont
    -0.08
    sx
    -0.08
    orsk
    -0.07
    رد
    -0.07
    -model
    -0.07
    OTOR
    -0.07
     erst
    -0.07
    Bern
    -0.07
    POSITIVE LOGITS
     Sail
    0.06
     Verify
    0.06
     Exactly
    0.06
     ومع
    0.06
     Wrath
    0.06
     Cou
    0.06
    .blank
    0.06
     confess
    0.06
    _sg
    0.06
    をか
    0.06
    Act Density 0.030%

    No Known Activations