INDEX
    Explanations

    expressions of personal identity and self-reflection

    New Auto-Interp
    Negative Logits
     itself
    -0.21
     reck
    -0.19
    st
    -0.18
    ly
    -0.18
    (s
    -0.17
    lx
    -0.16
    less
    -0.16
    lv
    -0.16
    liness
    -0.16
     themselves
    -0.16
    POSITIVE LOGITS
    ’m
    0.39
    'm
    0.34
    ’ve
    0.32
     am
    0.32
     myself
    0.32
    've
    0.31
     бÑĥдÑĥ
    0.27
    ’ll
    0.25
    /we
    0.24
    'll
    0.23
    Act Density 0.455%

    No Known Activations