INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    6
    0.68
    5
    0.62
    7
    0.53
    9
    0.52
    ל
    0.50
    ک
    0.47
    8
    0.46
    3
    0.44
    تي
    0.43
    יש
    0.43
    POSITIVE LOGITS
    o
    0.48
    色的
    0.40
    ot
    0.40
     siang
    0.39
    0.39
    𝙤
    0.38
     σαν
    0.37
    ο
    0.37
    oit
    0.35
    0.35
    Act Density 1.302%

    No Known Activations