INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    s
    1.40
    t
    1.20
    ne
    1.04
    на
    0.82
    h
    0.81
    v
    0.81
    y
    0.80
    g
    0.79
    kost
    0.77
    0.74
    POSITIVE LOGITS
    '
    1.40
    ל
    1.30
    '/>
    1.09
    ם
    1.07
     can
    0.97
    ר
    0.93
    '";
    0.89
    0.84
    0.82
     It
    0.81
    Act Density 0.000%

    No Known Activations