INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    an
    1.60
    ו
    1.59
    a
    1.58
    u
    1.43
    the
    1.39
    1.36
    1.36
    و
    1.34
    1.25
    м
    1.21
    POSITIVE LOGITS
    1.26
    1.20
     in
    1.16
    1.08
    ات
    1.07
    ,”
    1.00
    0.98
    '
    0.97
    0.96
    "</
    0.95
    Act Density 0.000%

    No Known Activations