INDEX
    Explanations

    phrases related to introspection and self-reflection

    New Auto-Interp
    Negative Logits
    icit
    -0.17
    ãģĵãģĿ
    -0.16
    ören
    -0.15
    SCORE
    -0.14
    аж
    -0.14
    lein
    -0.14
     à¤īत
    -0.14
    adora
    -0.14
    ick
    -0.14
    ãĥ¼ãĥŀ
    -0.14
    POSITIVE LOGITS
    nels
    0.15
    izza
    0.15
    ระ
    0.15
    reflect
    0.14
     reflect
    0.14
    aires
    0.14
     reflection
    0.14
    emb
    0.13
    opaque
    0.13
     Eins
    0.13
    Act Density 0.021%

    No Known Activations