INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ل
    0.93
    0.79
    as
    0.72
    ת
    0.69
    l
    0.59
    و
    0.57
    on
    0.56
    <unused1882>
    0.56
     patitth
    0.54
    hæng
    0.54
    POSITIVE LOGITS
    0
    0.79
    \
    0.63
    (),
    0.62
     of
    0.59
    ote
    0.58
     to
    0.58
     \
    0.55
    <
    0.55
     a
    0.54
    ør
    0.53
    Act Density 0.019%

    No Known Activations