INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     xhr
    -0.07
    umsuz
    -0.06
     q
    -0.06
     Voc
    -0.06
     vinc
    -0.06
     hurting
    -0.06
     eden
    -0.06
     nedenle
    -0.06
     mockery
    -0.06
    ridge
    -0.06
    POSITIVE LOGITS
     правиль
    0.06
     حزب
    0.06
    ellschaft
    0.06
    خي
    0.06
    ане
    0.06
    بيع
    0.06
    0.06
    (笑
    0.06
     السكان
    0.06
    _WM
    0.06
    Act Density 0.006%

    No Known Activations