INDEX
    Explanations

    references to the concept of 'self' or 'identity.'

    New Auto-Interp
    Negative Logits
    ville
    -0.16
    çĦ
    -0.15
    ogn
    -0.15
     Heaven
    -0.14
     Tru
    -0.14
    shire
    -0.14
     convenience
    -0.13
    uty
    -0.13
    "profile
    -0.13
    ãĤ¦ãĤ¹
    -0.13
    POSITIVE LOGITS
     itself
    0.29
    esen
    0.18
    à¹Ģà¸Ńà¸ĩ
    0.15
    elves
    0.15
    IntArray
    0.14
     ÄijÃłi
    0.14
    606
    0.14
     же
    0.14
     Gran
    0.14
    декÑģ
    0.14
    Act Density 0.045%

    No Known Activations