INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     insult
    -0.07
     allegiance
    -0.06
    ома
    -0.06
    ,get
    -0.06
    .recipe
    -0.06
    _layer
    -0.06
     Fou
    -0.06
     transferred
    -0.06
     Huss
    -0.06
     preferences
    -0.06
    POSITIVE LOGITS
    :^(
    0.08
    CONN
    0.07
     MOST
    0.06
    (td
    0.06
     Epoch
    0.06
     작업
    0.06
     AW
    0.06
    ep
    0.06
     PURE
    0.06
    (edge
    0.06
    Act Density 0.002%

    No Known Activations