INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ный
    1.37
    1.34
    𝘵
    1.30
    ных
    1.28
    𝘥
    1.23
    τε
    1.22
    1.20
    tedir
    1.20
     wykorzyst
    1.18
    1.18
    POSITIVE LOGITS
    er
    1.39
    i
    1.36
     stesse
    1.16
    ir
    1.09
    ro
    1.05
    ر
    1.01
     statistique
    0.95
    ли
    0.92
     locom
    0.92
    il
    0.91
    Act Density 0.001%

    No Known Activations