INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ch
    0.82
    д
    0.74
    z
    0.73
    ла
    0.69
    era
    0.65
    ро
    0.65
    op
    0.64
    да
    0.64
    л
    0.64
    و
    0.64
    POSITIVE LOGITS
    0
    0.94
    t
    0.88
     to
    0.71
    );
    0.64
    0.63
    tedir
    0.61
     o
    0.59
    تع
    0.59
    𝟬
    0.58
    𝟎
    0.58
    Act Density 0.009%

    No Known Activations