INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     великолеп
    -1.95
    😻
    -1.67
    💘
    -1.63
    -1.62
    🎊
    -1.59
    💓
    -1.53
    来讲
    -1.53
     iaitu
    -1.52
    -1.52
     triunfo
    -1.47
    POSITIVE LOGITS
     didn
    1.71
    </h2>
    1.66
     their
    1.62
     isn
    1.51
     already
    1.48
     это
    1.47
     they
    1.46
     дефек
    1.46
     that
    1.44
     tiene
    1.43
    Act Density 0.083%

    No Known Activations