INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     С
    0.63
     their
    0.57
    0.53
     onların
    0.53
    MINE
    0.52
    л
    0.52
     По
    0.52
     Та
    0.52
     are
    0.51
     an
    0.50
    POSITIVE LOGITS
    un
    0.78
    er
    0.75
    ar
    0.70
    im
    0.70
    ok
    0.66
    re
    0.66
    ir
    0.65
    ن
    0.64
    .
    0.62
    ed
    0.61
    Act Density 0.004%

    No Known Activations