INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     iso
    -0.08
     advantages
    -0.07
    Jos
    -0.07
     haha
    -0.06
     россий
    -0.06
     Solomon
    -0.06
     Isaac
    -0.06
     mais
    -0.06
    xea
    -0.06
     Wilde
    -0.06
    POSITIVE LOGITS
    ent
    0.12
    ENT
    0.10
    ент
    0.09
    enter
    0.08
    entr
    0.08
    entic
    0.08
    enta
    0.08
     ment
    0.08
     Rent
    0.08
    nt
    0.07
    Act Density 0.125%

    No Known Activations