INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Nagoya
    -0.96
    нили
    -0.96
    -0.91
    Badge
    -0.91
     nerfs
    -0.89
    pezif
    -0.88
    пами
    -0.86
     ROK
    -0.86
     Yus
    -0.85
    ăr
    -0.85
    POSITIVE LOGITS
    jemahan
    0.95
     for
    0.88
     avons
    0.88
    atividade
    0.86
     After
    0.85
     الأولى
    0.83
    atra
    0.82
    ڏ
    0.82
     ilha
    0.82
     אנו
    0.81
    Act Density 0.001%

    No Known Activations