INDEX
    Explanations

    actions related to understanding, discovering, or manipulating concepts

    New Auto-Interp
    Negative Logits
    its
    -0.17
     Its
    -0.17
    Its
    -0.17
    каз
    -0.14
     Rim
    -0.14
    appa
    -0.14
     Lar
    -0.14
    opleft
    -0.13
    utable
    -0.13
    GS
    -0.13
    POSITIVE LOGITS
     things
    0.30
     everything
    0.28
     stuff
    0.21
     Things
    0.20
     thing
    0.20
    everything
    0.20
    things
    0.19
     Everything
    0.19
    ä¸ĢåĪĩ
    0.19
     alles
    0.18
    Act Density 0.156%

    No Known Activations