INDEX
    Explanations

    Uncensored model benchmarking

    New Auto-Interp
    Negative Logits
     add
    -0.08
     plus
    -0.07
    -0.07
     elaborate
    -0.07
     integrate
    -0.07
     Jeremy
    -0.07
    -0.07
     faveur
    -0.07
    -0.07
    іра
    -0.07
    POSITIVE LOGITS
    -content
    0.09
    437
    0.09
     Wrangler
    0.09
     Content
    0.08
     IMO
    0.08
    -dro
    0.08
    	Content
    0.08
    -tests
    0.08
    -chat
    0.08
    117
    0.08
    Act Density 0.001%

    No Known Activations