INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    -0.08
     UPS
    -0.07
     lez
    -0.07
     Montes
    -0.07
    	label
    -0.07
     ak
    -0.07
     humanity
    -0.07
     Bake
    -0.07
    -0.07
     Burke
    -0.07
    POSITIVE LOGITS
    ativo
    0.08
    леж
    0.08
    ipt
    0.08
    orientation
    0.08
     influ
    0.07
    Decoder
    0.07
     contemplation
    0.07
     '~/
    0.07
    izon
    0.07
    ativos
    0.07
    Act Density 0.001%

    No Known Activations