INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    s
    -1.26
    IS
    -1.02
     тех
    -0.95
    çoivent
    -0.89
     完了
    -0.88
    AT
    -0.88
    !’
    -0.87
     красивый
    -0.86
    として
    -0.86
    ?"
    -0.86
    POSITIVE LOGITS
     Twitter
    1.37
     twitter
    1.30
    twitter
    1.20
    🐦
    1.11
     tweeting
    1.08
     Tweet
    1.07
    sphere
    1.05
     Tweets
    1.05
     Twe
    1.01
     twit
    1.01
    Act Density 0.008%

    No Known Activations