INDEX
    Explanations

    words that indicate approval or acceptance of actions and situations

    New Auto-Interp
    Negative Logits
     we
    -0.19
    we
    -0.17
    mtx
    -0.15
     deren
    -0.15
    åħ¶
    -0.15
    ewe
    -0.14
    /maps
    -0.14
     myself
    -0.14
     &
    -0.14
    We
    -0.14
    POSITIVE LOGITS
     our
    0.40
    our
    0.36
    æĪij们çļĦ
    0.33
     OUR
    0.32
     Our
    0.28
    è¿Ļæł·çļĦ
    0.28
     nosso
    0.28
    OUR
    0.28
     наÑĪиÑħ
    0.28
     such
    0.28
    Act Density 0.006%

    No Known Activations