INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ۔
    1.18
    1.00
    0.97
    0.94
    0.89
    0.80
    of
    0.78
    ،
    0.76
    х
    0.75
    ی
    0.75
    POSITIVE LOGITS
    ية
    1.00
    AN
    0.98
    o
    0.94
    ay
    0.94
    ag
    0.94
    IL
    0.92
    ur
    0.91
    AY
    0.89
    il
    0.88
    y
    0.88
    Act Density 0.014%

    No Known Activations