INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     again
    -0.07
    EE
    -0.07
    _GUI
    -0.07
     عاشق
    -0.07
    CHE
    -0.07
     dare
    -0.06
    (e
    -0.06
    .Dial
    -0.06
    .ham
    -0.06
    aine
    -0.06
    POSITIVE LOGITS
     reduction
    0.14
     Reduction
    0.13
     reductions
    0.12
     katkı
    0.09
     redesign
    0.08
     Radical
    0.07
     đứ
    0.07
    ्मक
    0.07
    layers
    0.07
    ZZ
    0.07
    Act Density 0.007%

    No Known Activations