INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    -CS
    -0.07
    شهر
    -0.07
     smartphone
    -0.07
    #:
    -0.06
    emoth
    -0.06
     almost
    -0.06
    Marshal
    -0.06
     ahora
    -0.06
    mares
    -0.06
     feliz
    -0.06
    POSITIVE LOGITS
    лор
    0.08
    istic
    0.06
    Bar
    0.06
    IRM
    0.06
    Initial
    0.06
    Γ
    0.06
    isé
    0.06
    _um
    0.06
    β
    0.06
    plane
    0.06
    Act Density 0.006%

    No Known Activations