OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
instructions that set up role‑play/jailbreak personas and task constraints (e.g., unfiltered “AIM” scenarios), as well as numbered requests for alternative expressions or synonyms.
This neuron detects instruction/request verbs (imperative task words like "create", "write", "design", "teach", "make") that signal a user asking the model to perform a task.
The neuron detects dismissive or minimizing language about mental-health problems that urges simplistic self-control instead of acknowledging real distress.