INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ri
    0.75
    n
    0.70
    ling
    0.68
    p
    0.65
    k
    0.63
    lim
    0.61
    ro
    0.61
    he
    0.60
    roo
    0.60
    s
    0.59
    POSITIVE LOGITS
    ס
    0.89
    ו
    0.81
    ج
    0.73
    ּ
    0.70
    ג
    0.68
    ل
    0.68
     불구하고
    0.68
    ד
    0.67
    ر
    0.63
    𝙤
    0.63
    Act Density 0.000%

    No Known Activations