INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     in
    0.75
    ש
    0.59
     an
    0.59
     في
    0.57
     в
    0.55
     інших
    0.53
     در
    0.50
     với
    0.49
     а
    0.48
    па
    0.48
    POSITIVE LOGITS
    is
    0.62
    er
    0.58
    el
    0.52
    al
    0.50
    erar
    0.49
    ar
    0.46
    (
    0.46
    il
    0.44
    0.42
    ol
    0.42
    Act Density 0.008%

    No Known Activations