INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    <Element
    -0.10
    <Int
    -0.08
     Diz
    -0.08
    یه
    -0.08
     Yog
    -0.08
    ibur
    -0.07
    amodel
    -0.07
     Hes
    -0.07
     simmer
    -0.07
    hunt
    -0.07
    POSITIVE LOGITS
    0.09
     alike
    0.09
    伦理
    0.08
     envis
    0.08
     etiquette
    0.08
    0.08
     evolution
    0.08
    友情
    0.08
     objeto
    0.08
     ethn
    0.08
    Act Density 0.013%

    No Known Activations