INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    .HasPrefix
    -0.07
    ştır
    -0.07
    입니다
    -0.07
    одерж
    -0.06
    роп
    -0.06
     الشي
    -0.06
    _DIRECT
    -0.06
     لب
    -0.06
     raj
    -0.06
    ость
    -0.06
    POSITIVE LOGITS
    Donate
    0.07
    Cat
    0.07
    hon
    0.07
    0.06
    _w
    0.06
     baggage
    0.06
     influence
    0.06
     pornography
    0.06
    Condition
    0.06
     Control
    0.06
    Act Density 0.018%

    No Known Activations