INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     activates
    -0.09
    ampler
    -0.08
     pops
    -0.08
    θεί
    -0.07
     gymnastics
    -0.07
     infused
    -0.07
     hortic
    -0.07
     chaud
    -0.07
     wife
    -0.07
     ctor
    -0.07
    POSITIVE LOGITS
    uchi
    0.08
    avo
    0.08
     minimalist
    0.08
     з
    0.08
     prohibit
    0.08
     solamente
    0.07
    0.07
    otro
    0.07
    —which
    0.07
     (!)
    0.07
    Act Density 0.059%

    No Known Activations