INDEX
    Explanations

    phrases related to self-referential concepts and self-directed actions

    New Auto-Interp
    Negative Logits
    ãĥ©ãĤ¹
    -0.16
    uted
    -0.14
    uen
    -0.14
    _callbacks
    -0.14
    ayed
    -0.14
    orelease
    -0.13
    ÙĪØ§Ø±
    -0.13
    otton
    -0.13
    opa
    -0.13
    eming
    -0.13
    POSITIVE LOGITS
    same
    0.25
    -contained
    0.24
    ishly
    0.23
    lessly
    0.22
    ridge
    0.22
    änd
    0.21
    Contained
    0.21
     explanatory
    0.21
    contained
    0.21
    hood
    0.19
    Act Density 0.024%

    No Known Activations