INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     thác
    -0.07
    -0.07
    adores
    -0.06
    dic
    -0.06
    нее
    -0.06
    طن
    -0.06
     Stephens
    -0.06
     начал
    -0.06
    =max
    -0.06
    cc
    -0.06
    POSITIVE LOGITS
    )(*
    0.07
    ([])↵
    0.07
     posters
    0.07
     Crazy
    0.06
     CAST
    0.06
     polit
    0.06
     identities
    0.06
     ideologies
    0.06
     kvinne
    0.06
    0.06
    Act Density 0.001%

    No Known Activations