INDEX
    Explanations

    phrases indicating self-involvement or self-referential actions

    phrases suggesting actions of self-identification or self-reference

    New Auto-Interp
    Negative Logits
    gap
    -0.70
    heny
    -0.70
     Feature
    -0.67
    grade
    -0.65
    lav
    -0.63
    jug
    -0.59
    ahan
    -0.59
    illary
    -0.59
     IPS
    -0.59
    culosis
    -0.57
    POSITIVE LOGITS
     pant
    0.69
    ortium
    0.64
     hunted
    0.64
    æµ
    0.62
    isner
    0.62
     ashamed
    0.60
    ãģĹ
    0.60
     disgust
    0.60
    peror
    0.60
     sanct
    0.60
    Act Density 0.192%

    No Known Activations