INDEX
    Explanations

    various forms of the word "this" and related demonstrative terms in different contexts

    New Auto-Interp
    Negative Logits
     scor
    -0.16
    bah
    -0.16
    angelo
    -0.15
    agraph
    -0.15
    angered
    -0.14
    inge
    -0.14
    身
    -0.14
    vas
    -0.14
    icone
    -0.14
    igar
    -0.14
    POSITIVE LOGITS
    ifar
    0.17
    ekil
    0.16
    UPLE
    0.16
    endas
    0.15
    auer
    0.15
    žÃŃ
    0.15
    GLISH
    0.15
     deflate
    0.14
     kun
    0.14
    olem
    0.14
    Act Density 0.001%

    No Known Activations