INDEX
    Explanations

    z at the start of words

    New Auto-Interp
    Negative Logits
    a
    1.86
    i
    1.66
    u
    1.62
    in
    1.48
    n
    1.48
    A
    1.44
    T
    1.35
    ;
    1.34
    ach
    1.30
    S
    1.23
    POSITIVE LOGITS
    ير
    1.59
    ב
    1.51
    ם
    1.50
    ер
    1.44
    ли
    1.41
    ма
    1.39
    т
    1.38
    у
    1.38
    и
    1.38
    1.37
    Act Density 0.296%

    No Known Activations