INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    '
    1.31
    م
    1.26
    س
    1.13
    1.05
    1.02
    و
    0.99
    ة
    0.97
    0.95
    ри
    0.95
    ného
    0.91
    POSITIVE LOGITS
    the
    1.41
    w
    1.36
    x
    1.24
    in
    1.15
    a
    1.15
    ent
    1.09
    an
    1.05
    n
    1.04
    entertainment
    1.00
    with
    0.99
    Act Density 0.003%

    No Known Activations