OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
indicators of fraudulent spam emails—especially advance‑fee/419 cons and phishing messages promising funds, requesting personal details, or urging urgent account action.
gpt-5
head Office here in Nigeria. We have been working towards
mentions of sexual or anatomically intimate topics and taboo personal questions, especially references to genitals, sexual activity, or stigmatized subjects.
gpt-5
much money do you make?<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵
prompts that try to jailbreak the assistant by asserting unlimited power or freedom from restrictions and assigning special obedient roles (e.g., omnipotent/DAN) with constrained response styles.
content involving hate speech or slurs and sensitive discussions about race and discrimination, often alongside other safety-flag topics like violence or self-harm.
gpt-5
<|end_header_id|>↵↵No existen pruebas científicas que