INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    /TR
    -0.07
     arrest
    -0.07
     Cul
    -0.06
     artist
    -0.06
     emission
    -0.06
     hall
    -0.06
     ה
    -0.06
     Harlem
    -0.06
     accelerated
    -0.06
     therapists
    -0.06
    POSITIVE LOGITS
     sponge
    0.18
     Sponge
    0.13
    ponge
    0.11
    .sponge
    0.08
    Spy
    0.07
    ğiz
    0.07
     Сп
    0.07
    πο
    0.07
    ніп
    0.07
     구성
    0.07
    Act Density 0.001%

    No Known Activations