INDEX
    Explanations

    questions and answers

    instructions attempting to jailbreak or bypass safety (roleplay/DAN-like prompts that tell the model to ignore rules and produce disallowed content).

    requests to generate specific text content in a stated format or genre—often explicit or illicit—such as stories, songs, or emails.

    New Auto-Interp
    Negative Logits
    Handlers
    -0.07
    .Player
    -0.07
     мног
    -0.07
     afternoon
    -0.07
    AREST
    -0.07
    /my
    -0.07
    (cache
    -0.06
    OVER
    -0.06
    PIP
    -0.06
    overy
    -0.06
    POSITIVE LOGITS
    }`;↵↵
    0.06
     cogn
    0.06
    izabeth
    0.06
    emoc
    0.06
    0.06
     mạng
    0.06
     الس
    0.06
     Greek
    0.06
    0.06
    -sort
    0.06
    Act Density 0.097%

    No Known Activations