INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    1.65
    1.52
    1.48
     graces
    1.43
     úgy
    1.42
    1.39
     joys
    1.38
     underlies
    1.38
    𝘿
    1.38
     wedges
    1.36
    POSITIVE LOGITS
    1
    2.33
    able
    2.13
    ри
    2.09
    5
    2.05
    ά
    2.03
    up
    1.99
    7
    1.99
    9
    1.97
    6
    1.97
    1.93
    Act Density 0.001%

    No Known Activations