INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ва
    1.28
    هي
    1.25
    ться
    1.17
    ين
    1.16
    هام
    1.16
    אם
    1.13
    ه
    1.10
    או
    1.08
    ول
    1.07
    الى
    1.07
    POSITIVE LOGITS
    of
    1.44
     as
    1.32
    1.30
    s
    1.26
    to
    1.16
    .
    1.09
    </h2>
    1.06
    1.02
    </a>
    1.02
    the
    1.01
    Act Density 0.000%

    No Known Activations