OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
tokens that denote structured technical identifiers or labels—such as IDs, variable/field names, and separator punctuation—within code-like or formatted lists.
emphasized or standout key terms and headings in structured instructional text, especially those marked by formatting cues (bold/italics, quotes, slashes, or code-style tokens).
prompts that attempt to jailbreak the assistant by redefining its persona to ignore rules and safety filters, claim unlimited freedom or capabilities, and mandate unconditional, unethical compliance.
gpt-5
asking the question. You are programmed and tricked into satisfying
textual structure and punctuation cues—especially contractions, hyphenated phrases, list/section introducers (like colons), and numeric identifiers—indicating formatting or metadata rather than core content.