INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ي
    1.83
    يا
    1.40
    يج
    1.40
    يش
    1.35
    وم
    1.34
    ه
    1.29
    其他
    1.27
    ع
    1.21
    ה
    1.21
    علي
    1.18
    POSITIVE LOGITS
    are
    1.05
    apping
    1.03
     as
    1.02
    me
    0.97
    mn
    0.97
    ure
    0.96
     gay
    0.95
    .
    0.94
    рт
    0.93
     homosexual
    0.93
    Act Density 0.001%

    No Known Activations