OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
The neuron activates on pronouns, possessive adjectives, and conjunctions when referring to people, often in the context of personal involvement or identity.
The neuron primarily activates on frequently occurring words like "the" and "and" when they appear in technical or instructional contexts, often in close proximity to numbers or specialized terms.
gemini-2.5-flash
pattern may include the sub-steps of: comparing the
The neuron spotlights special control‐ or header‐tokens (like the `<|start_header_id|>`, `<|end_header_id|>`, and similar markers) that delimit and label parts of the chat transcript.
o4-mini
"Do you understand?"<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵No!