INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     scratching
    -0.07
    -0.07
     σύν
    -0.07
     criter
    -0.07
     shattered
    -0.07
     isAdmin
    -0.06
     patiently
    -0.06
    итом
    -0.06
     Marian
    -0.06
     po
    -0.06
    POSITIVE LOGITS
     unlike
    0.09
     Unlike
    0.08
    Unlike
    0.07
    으며
    0.06
     differently
    0.06
    YL
    0.06
    ουλ
    0.06
     Illegal
    0.06
    dır
    0.06
    .down
    0.06
    Act Density 0.007%

    No Known Activations