INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    有很多
    -0.84
    💄
    -0.83
    meu
    -0.82
     поэтому
    -0.81
    很多人
    -0.80
     mening
    -0.80
    👗
    -0.80
    rhino
    -0.79
    Hebrews
    -0.79
     svoje
    -0.79
    POSITIVE LOGITS
     does
    1.45
     did
    1.10
     it
    1.04
     DOES
    0.95
    cedo
    0.93
     Does
    0.88
    leski
    0.81
     !$
    0.80
     sobr
    0.79
    ضور
    0.78
    Act Density 0.071%

    No Known Activations