INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     bk
    -0.06
    ilitary
    -0.06
     details
    -0.06
     Elephant
    -0.06
     decomposition
    -0.06
    )+
    -0.06
     who
    -0.06
     Kho
    -0.06
     nothing
    -0.06
     maiden
    -0.06
    POSITIVE LOGITS
    igue
    0.07
    0.07
     Amerika
    0.06
     Sociology
    0.06
    0.06
    _AM
    0.06
     illum
    0.06
     Gerr
    0.06
     scoped
    0.06
    yles
    0.06
    Act Density 0.025%

    No Known Activations