INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     blessings
    -0.08
    NA
    -0.08
     директор
    -0.08
    не
    -0.07
     hola
    -0.07
     করি
    -0.07
     Quito
    -0.07
     prachtige
    -0.07
    Sala
    -0.07
     пищ
    -0.07
    POSITIVE LOGITS
     criticizing
    0.08
     boy
    0.07
    userid
    0.07
    סום
    0.07
     maling
    0.07
     Styling
    0.07
     colspan
    0.07
    show
    0.07
    peg
    0.07
     gradients
    0.07
    Act Density 0.003%

    No Known Activations