INDEX
    Explanations

    references to personal experiences and self-identity

    New Auto-Interp
    Negative Logits
     itself
    -0.25
    ness
    -0.21
     themselves
    -0.20
    ly
    -0.19
    wers
    -0.18
    ting
    -0.17
    rette
    -0.16
    appen
    -0.16
    ship
    -0.16
    nya
    -0.16
    POSITIVE LOGITS
    /us
    0.63
    /her
    0.43
     personally
    0.33
    /my
    0.30
    zelf
    0.28
    -même
    0.28
    adows
    0.27
    adow
    0.27
    SELF
    0.27
    andering
    0.26
    Act Density 0.248%

    No Known Activations