INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     mutated
    -0.07
     prepar
    -0.06
     cứ
    -0.06
     honey
    -0.06
     Phrase
    -0.06
     سفید
    -0.06
    Tour
    -0.06
     тор
    -0.06
     Ρ
    -0.06
     Ment
    -0.06
    POSITIVE LOGITS
    방송
    0.07
     samt
    0.06
    exc
    0.06
     ForCanBeConvertedToForeach
    0.06
    -win
    0.06
    اگ
    0.06
    urers
    0.06
    gew
    0.06
     ours
    0.06
    ाय
    0.06
    Act Density 0.039%

    No Known Activations