INDEX
    Explanations

    prompts encouraging critical thinking and reflection

    New Auto-Interp
    Negative Logits
    arness
    -0.15
    usercontent
    -0.14
     Hacker
    -0.14
    ãĥ¡ãĥ©
    -0.14
    æĤł
    -0.14
    .azure
    -0.14
    ecycle
    -0.14
    ucker
    -0.13
    ibaba
    -0.13
    aran
    -0.13
    POSITIVE LOGITS
     yourself
    0.17
    tout
    0.16
    åIJ§
    0.16
    ance
    0.15
    ables
    0.15
    865
    0.14
     Yourself
    0.14
    ZA
    0.14
    able
    0.14
    778
    0.14
    Act Density 0.067%

    No Known Activations