INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ?
    1.67
    1.49
    ש
    1.48
    ו
    1.45
    ל
    1.41
    л
    1.36
    א
    1.34
    1.26
    ط
    1.24
    1.24
    POSITIVE LOGITS
    y
    1.31
    yta
    1.15
    I
    1.15
    ir
    1.13
    inou
    1.13
    ور
    1.11
    gk
    1.11
    рия
    1.10
    inces
    1.10
    ale
    1.08
    Act Density 0.002%

    No Known Activations