INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    (match
    -0.07
    及其
    -0.07
     bk
    -0.06
     vacuum
    -0.06
    anything
    -0.06
    	Matrix
    -0.06
    FTER
    -0.06
    /no
    -0.06
    EMPTY
    -0.06
    یکی
    -0.06
    POSITIVE LOGITS
    μπο
    0.07
    -Allow
    0.06
    ()}↵
    0.06
     استرات
    0.06
    loses
    0.06
    ainted
    0.06
    (chart
    0.06
    0.06
    _BUFF
    0.06
     frontal
    0.06
    Act Density 0.012%

    No Known Activations