INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     are
    -0.74
     who
    -0.56
     charge
    -0.55
     attempt
    -0.55
     play
    -0.54
     have
    -0.53
     look
    -0.51
     aren
    -0.51
    are
    -0.50
     were
    -0.50
    POSITIVE LOGITS
     is
    0.98
     has
    0.96
     wears
    0.89
     owns
    0.87
     develops
    0.87
     happens
    0.84
     buys
    0.82
     sees
    0.81
     obtains
    0.81
     sleeps
    0.80
    Act Density 0.082%

    No Known Activations