INDEX
    Explanations

    phrases with the structure "self-[word]"

    phrases related to self-identity or self-awareness

    New Auto-Interp
    Negative Logits
    ulhu
    -1.04
     "$:/
    -0.83
     Hutch
    -0.74
     AX
    -0.70
     Chains
    -0.69
     Rouge
    -0.69
     Basin
    -0.68
     Shaw
    -0.68
     Starr
    -0.67
    ÙIJ
    -0.67
    POSITIVE LOGITS
    imposed
    1.14
    proclaimed
    1.07
    esteem
    1.06
    talk
    1.05
    conscious
    1.04
    destruct
    1.04
    contained
    1.00
    decl
    0.98
    generated
    0.98
    described
    0.96
    Act Density 0.048%

    No Known Activations