INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     분류
    -0.07
    Parsing
    -0.06
    .put
    -0.06
     taste
    -0.06
    Rest
    -0.06
     punching
    -0.06
    	ax
    -0.06
    -0.06
     shuffled
    -0.06
    Lines
    -0.06
    POSITIVE LOGITS
     clk
    0.07
    нитель
    0.07
     burglary
    0.07
    Am
    0.06
            				
    0.06
    ris
    0.06
     dönem
    0.06
     juven
    0.06
    membership
    0.06
     hydration
    0.06
    Act Density 0.011%

    No Known Activations