INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     success
    -0.81
     consistent
    -0.68
    consistent
    -0.67
     inconsistent
    -0.63
     couch
    -0.59
     Muses
    -0.57
    -------
    -0.57
     nahilalakip
    -0.56
    theless
    -0.55
    ized
    -0.54
    POSITIVE LOGITS
     liga
    0.55
    sent
    0.55
    iest
    0.55
    scar
    0.55
    sharing
    0.54
    י
    0.54
    UAWEI
    0.53
    er
    0.52
    shows
    0.52
    save
    0.52
    Act Density 0.176%

    No Known Activations