INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    orpion
    -0.07
     آزاد
    -0.06
    صات
    -0.06
    Gay
    -0.06
    阅读
    -0.06
     mevcut
    -0.06
    -0.06
    /us
    -0.06
    rawing
    -0.06
    .purchase
    -0.06
    POSITIVE LOGITS
     fidelity
    0.10
    idelity
    0.07
     anecdotes
    0.07
     قدر
    0.07
     consequence
    0.06
     mismatch
    0.06
     Imper
    0.06
     Pro
    0.06
     Mi
    0.06
     dří
    0.06
    Act Density 0.001%

    No Known Activations