INDEX
    Explanations

    words related to the concept of 'self'

    instances of the word "self" in different contexts

    New Auto-Interp
    Negative Logits
     Flags
    -0.81
     Rabbit
    -0.72
     Powers
    -0.72
     Decay
    -0.68
     Pose
    -0.68
    Shot
    -0.66
     Crus
    -0.64
     Paradise
    -0.63
     Canary
    -0.63
     Barrier
    -0.62
    POSITIVE LOGITS
    actory
    1.09
    onso
    1.01
    rint
    0.99
    lf
    0.97
    ibrary
    0.96
    enn
    0.92
     bour
    0.91
    poons
    0.89
    andom
    0.89
    ood
    0.89
    Act Density 0.007%

    No Known Activations