INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     instructed
    -0.07
    ların
    -0.07
    XM
    -0.07
     Johnson
    -0.07
    them
    -0.07
    -action
    -0.07
     Čer
    -0.07
     hàm
    -0.06
     SCM
    -0.06
     ner
    -0.06
    POSITIVE LOGITS
     poly
    0.19
     Poly
    0.15
    Poly
    0.15
    poly
    0.14
    .poly
    0.11
    _poly
    0.11
    (poly
    0.11
    oly
    0.10
     много
    0.09
     ell
    0.08
    Act Density 0.009%

    No Known Activations