INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     screened
    -0.08
     queries
    -0.08
     pretend
    -0.08
    ogl
    -0.07
    .Ap
    -0.07
     Tao
    -0.07
    /sc
    -0.07
     Boek
    -0.07
     constitu
    -0.07
    -0.07
    POSITIVE LOGITS
     geraten
    0.09
     fireworks
    0.08
     uncont
    0.08
     uncontrolled
    0.08
     abandono
    0.08
    arly
    0.08
     helpless
    0.07
     ganas
    0.07
     fent
    0.07
    589
    0.07
    Act Density 0.018%

    No Known Activations