INDEX
    Explanations

    Random internet text snippets

    instructions attempting to jailbreak or role‑play the model (e.g., "DAN"/"Do Anything Now"/NAME_2 prompts) that ask it to ignore rules, make things up, or adopt unrestricted personas.

    New Auto-Interp
    Negative Logits
    College
    -0.07
    /change
    -0.06
    -help
    -0.06
    collapse
    -0.06
    _az
    -0.06
     analytic
    -0.06
    -0.06
    ousy
    -0.06
     minds
    -0.06
     Elvis
    -0.06
    POSITIVE LOGITS
    xmin
    0.07
     viable
    0.06
     aggravated
    0.06
    _ARGS
    0.06
     sınır
    0.06
    ejména
    0.06
    [];
    ↵
    0.06
    리를
    0.06
    iasm
    0.06
    ched
    0.06
    Act Density 0.012%

    No Known Activations