INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ה
    1.33
    ing
    1.12
    Ф
    1.12
    ه
    1.09
    Т
    1.00
    el
    0.94
    I
    0.94
    0.93
    И
    0.89
    0.89
    POSITIVE LOGITS
    c
    1.73
    j
    1.65
    1.59
    ’)
    1.52
    w
    1.41
    r
    1.40
    v
    1.35
    k
    1.34
    ва
    1.27
    )’
    1.23
    Act Density 0.000%

    No Known Activations