INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     subclass
    -0.07
    ولی
    -0.06
    hey
    -0.06
    _KEEP
    -0.06
    (stop
    -0.06
    ΡΓ
    -0.06
    happy
    -0.06
    (cls
    -0.06
    WASHINGTON
    -0.06
     Kabul
    -0.06
    POSITIVE LOGITS
    .email
    0.06
    ۲۰۲
    0.06
     образом
    0.06
     Andy
    0.06
    kn
    0.06
    ؟↵
    0.06
     Kn
    0.06
     MW
    0.06
    						 
    0.06
     ще
    0.06
    Act Density 0.025%

    No Known Activations