INDEX
    Explanations

    reflexive pronouns

    New Auto-Interp
    Negative Logits
     itſelf
    -1.12
    themselves
    -1.10
    itself
    -1.07
    himself
    -1.07
     himself
    -1.03
     Himself
    -0.98
     themselves
    -0.98
     themſelves
    -0.97
     itself
    -0.97
     himſelf
    -0.97
    POSITIVE LOGITS
     can
    0.52
     understand
    0.46
     but
    0.44
     finally
    0.43
     want
    0.43
     have
    0.42
     get
    0.42
     if
    0.42
     we
    0.41
     know
    0.41
    Act Density 0.023%

    No Known Activations