INDEX
    Explanations

    references to self-awareness and personal agency

    New Auto-Interp
    Negative Logits
    ana
    -0.15
    aver
    -0.15
    .invalidate
    -0.15
     Sil
    -0.15
     Moral
    -0.14
    ãĤ·ãĥ¼
    -0.14
     Hoff
    -0.14
    asury
    -0.14
    enders
    -0.14
    612
    -0.14
    POSITIVE LOGITS
    页éĿ¢åŃĺæ¡£å¤ĩ份
    0.19
    inel
    0.15
    ÌĨ
    0.14
    wij
    0.14
    ảo
    0.14
    iffies
    0.14
    uger
    0.14
    untime
    0.14
    ëį°
    0.14
     tục
    0.14
    Act Density 0.838%

    No Known Activations