INDEX
    Explanations

    instances of self-referential or personal statements

    New Auto-Interp
    Negative Logits
    strup
    -0.16
    lili
    -0.14
     Roller
    -0.14
    aries
    -0.14
    uali
    -0.13
    402
    -0.13
    elial
    -0.13
     force
    -0.13
     Biz
    -0.13
    386
    -0.13
    POSITIVE LOGITS
    etch
    0.16
    ãģıãģł
    0.15
    ataka
    0.15
    主任
    0.14
    eya
    0.14
    ala
    0.14
    ãĥĭãĤ¢
    0.14
    ument
    0.14
    odied
    0.13
    ude
    0.13
    Act Density 0.069%

    No Known Activations