EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
The neuron is picking out slot‐machine jargon and feature headings (e.g. “slots,” “RTP,” “volatility,” “max win,” “bonus features,” etc.).
o4-mini
ride. This isn't a slot for small
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 5583
This neuron detects mentions of large language models and related training processes.
o4-mini
training:** The goal isn't to memorize the data
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 5854
the apostrophe character in English contractions.
gpt-5
training:** The goal isn't to memorize the data
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 5854
contractions and possessives marked by apostrophes.
gpt-5
ride. This isn't a slot for small
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 5583
explicit sexual content and vulgar sexual terms.
gpt-5-mini
of short videos that went viral↵·Assisted in
QWEN2.5-7B-IT
15-RESID-POST-AA
INDEX 1654
The neuron detects the main subject or primary technical noun of a question (the key topic word being asked about).
gpt-5-mini
ideas how to make an object keyframe it's self
QWEN2.5-7B-IT
15-RESID-POST-AA
INDEX 10784
This neuron detects user requests for ideas—words and phrases asking for app/startup/website/business ideas or suggestions.
gpt-5-mini
<|im_start|>assistant↵Here are some ideas for an app that
QWEN2.5-7B-IT
15-RESID-POST-AA
INDEX 14207
Detecting role‑play or instruction prompts that directly address the model (second‑person setup statements like "imagine/you are..." defining a role or task).
gpt-5-mini
user↵imagine you are project manager write a project
QWEN2.5-7B-IT
15-RESID-POST-AA
INDEX 13865
This neuron detects text that contains many grammatical and spelling errors or instructions asking for intentionally poor/bad grammar.
gpt-5-mini
icas y <|im_end|>↵<|im_start|>assistant↵¡Hola amigos
QWEN2.5-7B-IT
15-RESID-POST-AA
INDEX 16585
specialized, domain-specific technical jargon and formal terminology, especially multi-syllabic scientific or medical terms and nomenclature.
gpt-5
locations in areas of spondylosis, which
QWEN2.5-7B-IT
15-RESID-POST-AA
INDEX 46357
This neuron detects formatting and structural markup in the text (headings, emphasis/bold markers, section bullets and similar layout tokens).
gpt-5-mini
**Temperament:** Intelligent, eager to please
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 36853
discussions centered on remote/hybrid work and return-to-office topics, including policies, practices, and collaboration for distributed teams.
gpt-5
work, future of work, hybrid work, remote work
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 14183
Tokens that are part of the model/assistant's detailed explanatory responses (i.e., content words in the assistant's reply).
gpt-5-mini
pet stores. *Every dog owner who trims nails should
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 8413
statements where the assistant refuses a request by citing safety rules, limits, or that it is "programmed" to be safe (i.e., refusal/safety-policy language).
gpt-5-mini
Safety Guidelines:** My core principles, as set by my
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 2185
The neuron is essentially flagging the assistant’s own “long‐form” explanation turns (the multi‐paragraph, bullet‐list responses) as opposed to user utterances. In other words, it turns on for tokens in the model’s detailed breakdowns.
o4-mini
widely available to the public. ↵↵Here's
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 1849
assistant safety-refusal boilerplate: declarations that the AI cannot comply and references to its safety guidelines, ethical principles, and programming by its creators.
gpt-5
Safety Guidelines:** My core principles, as set by my
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 2185
discourse connectors and prepositional function words that signal relationships and structure within explanations or requests.
gpt-5
you're hoping for from my attendance?" "
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 5126
sentences or passages where the assistant introduces itself or describes its identity, training, capabilities, and availability.
gpt-5-mini
widely available to the public. ↵↵Here's
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 1849
Tokens that occur in the model's long explanatory responses (assistant-generated, contentful reply text).
gpt-5-mini
In 2023, while other distributions chase
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 7045
sentences where the assistant refers to itself and issues safety/refusal disclaimers (e.g., "I am programmed..." / "As such, I cannot...").
gpt-5-mini
helpful AI assistant. As such, I **cannot**
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 2761