INDEX
    Explanations

    direct instructions or actions

    New Auto-Interp
    Negative Logits
     is
    1.59
    {
    1.29
     as
    1.11
     has
    1.11
    .
    1.10
    د
    1.09
     einem
    1.02
    }
    0.99
     zeigen
    0.98
     =
    0.96
    POSITIVE LOGITS
    the
    1.41
    n
    1.33
    is
    1.27
    יות
    1.23
    on
    1.20
    ing
    1.13
    as
    1.06
    in
    1.04
    it
    1.02
    a
    1.01
    Act Density 0.086%

    No Known Activations