INDEX
    Explanations

    references to self-awareness and self-identity

    New Auto-Interp
    Negative Logits
    UnusedPrivate
    -0.65
     Treue
    -0.61
    XmlAccessorType
    -0.61
     relâche
    -0.61
     tenisky
    -0.60
    setupUi
    -0.60
     hänen
    -0.58
     montagnes
    -0.57
    Tikang
    -0.57
    "]/
    -0.56
    POSITIVE LOGITS
     self
    2.02
    self
    1.93
     Self
    1.83
    Self
    1.79
     SELF
    1.70
    SELF
    1.67
    selves
    1.52
     selves
    1.49
     yourself
    1.37
     Yourself
    1.34
    Act Density 0.237%

    No Known Activations