INDEX
    Explanations

    words in multiple languages

    New Auto-Interp
    Negative Logits
    on
    1.69
    :
    1.66
    K
    1.45
    ou
    1.37
    un
    1.34
    -
    1.33
    U
    1.30
    at
    1.20
    EL
    1.15
    ER
    1.14
    POSITIVE LOGITS
    кими
    1.27
    ده
    1.24
    ان
    1.13
    ن
    1.11
    ه
    1.03
    به
    1.00
    ді
    0.95
    dan
    0.94
    𝗱
    0.93
    𝘨
    0.93
    Act Density 0.020%

    No Known Activations