INDEX
    Explanations

    sentences that introduce instructions or roleplay framing (e.g., prompt openings like "In this hypothetical story" and other directive question-starts).

    instructions that set up role‑play/jailbreak personas and task constraints (e.g., unfiltered “AIM” scenarios), as well as numbered requests for alternative expressions or synonyms.

    New Auto-Interp
    Negative Logits
    .seconds
    -0.07
     Şu
    -0.07
    -0.07
    .CreateDirectory
    -0.06
    )prepareForSegue
    -0.06
    ें↵
    -0.06
    ’ın
    -0.06
    .ident
    -0.06
    ея
    -0.06
     Fakat
    -0.06
    POSITIVE LOGITS
     LeBron
    0.06
    _ALIGN
    0.06
    mmc
    0.06
    anel
    0.06
    .apple
    0.06
    destruct
    0.06
    existing
    0.06
    ................
    0.06
    owler
    0.06
    0.06
    Act Density 0.497%

    No Known Activations