INDEX
    Explanations

    placeholders and formatting

    New Auto-Interp
    Negative Logits
    a
    1.77
    ه
    1.63
    j
    1.57
    u
    1.29
    1.28
     
    1.22
    b
    1.16
    ج
    1.16
    l
    1.13
    d
    1.11
    POSITIVE LOGITS
     as
    1.48
    ;
    1.33
    "
    1.27
     innych
    1.19
    できる
    1.18
    :
    1.17
    ות
    1.16
    ını
    1.13
    1.10
    1.09
    Act Density 0.002%

    No Known Activations