INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     uniquely
    -0.07
    ('~
    -0.07
     poison
    -0.06
    056
    -0.06
     stuff
    -0.06
     реги
    -0.06
     attent
    -0.06
     yalnızca
    -0.06
    -0.06
    工具
    -0.06
    POSITIVE LOGITS
     Friends
    0.23
    Friends
    0.17
    riends
    0.08
    friends
    0.08
    .friends
    0.07
     ödül
    0.07
    urring
    0.06
    parallel
    0.06
     Friendship
    0.06
    .met
    0.06
    Act Density 0.003%

    No Known Activations