INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    [layer
    -0.08
    (layer
    -0.08
     baseline
    -0.08
    Baseline
    -0.08
    Layer
    -0.08
    Bas
    -0.08
    layer
    -0.07
     tier
    -0.07
     partying
    -0.07
    idget
    -0.07
    POSITIVE LOGITS
     Kov
    0.09
     разруш
    0.09
    0.09
    _WINDOWS
    0.08
     Fedora
    0.08
     Push
    0.08
     scap
    0.08
     pux
    0.08
    	push
    0.08
     PUSH
    0.08
    Act Density 0.003%

    No Known Activations