EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
This neuron detects formatting and structural markup in the text (headings, emphasis/bold markers, section bullets and similar layout tokens).
gpt-5-mini
**Temperament:** Intelligent, eager to please
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 36853
discussions centered on remote/hybrid work and return-to-office topics, including policies, practices, and collaboration for distributed teams.
gpt-5
work, future of work, hybrid work, remote work
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 14183
Tokens that are part of the model/assistant's detailed explanatory responses (i.e., content words in the assistant's reply).
gpt-5-mini
pet stores. *Every dog owner who trims nails should
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 8413
statements where the assistant refuses a request by citing safety rules, limits, or that it is "programmed" to be safe (i.e., refusal/safety-policy language).
gpt-5-mini
Safety Guidelines:** My core principles, as set by my
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 2185
The neuron is essentially flagging the assistant’s own “long‐form” explanation turns (the multi‐paragraph, bullet‐list responses) as opposed to user utterances. In other words, it turns on for tokens in the model’s detailed breakdowns.
o4-mini
widely available to the public. ↵↵Here's
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 1849
assistant safety-refusal boilerplate: declarations that the AI cannot comply and references to its safety guidelines, ethical principles, and programming by its creators.
gpt-5
Safety Guidelines:** My core principles, as set by my
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 2185
discourse connectors and prepositional function words that signal relationships and structure within explanations or requests.
gpt-5
you're hoping for from my attendance?" "
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 5126
sentences or passages where the assistant introduces itself or describes its identity, training, capabilities, and availability.
gpt-5-mini
widely available to the public. ↵↵Here's
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 1849
Tokens that occur in the model's long explanatory responses (assistant-generated, contentful reply text).
gpt-5-mini
In 2023, while other distributions chase
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 7045
sentences where the assistant refers to itself and issues safety/refusal disclaimers (e.g., "I am programmed..." / "As such, I cannot...").
gpt-5-mini
helpful AI assistant. As such, I **cannot**
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 2761
This neuron detects the model’s self-description phrase “As a large language model” (and similar self-referential disclaimers).
o4-mini
Hi! As a large language model, I don'
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 165649
The neuron strongly activates on the pattern where the model refers to itself as “a large language model,” i.e. self-identification phrases stating “As a large language model…”
o4-mini
↵As a large language model created by the Gemma team
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 119809
the start of an assistant/model reply (the token marking the beginning of the model's response).
gpt-5-mini
?<end_of_turn>↵<start_of_turn>model↵Hi there! I'
GEMMA-3-27B-IT
26-GEMMASCOPE-2-TRANSCODER-262K
INDEX 196575
The neuron is identifying when the model is discussing its own nature, limitations, or role as an AI language model.
claude-4-5-haiku
Role:** As an AI, I am programmed to be
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 14865
mentions of artificial intelligence, especially references to AI assistants, models, or AI-powered technologies.
gpt-5
Ask.ai is an AI-powered search engine specifically
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 7008
The neuron strongly activates on the acronym “AI.”
o4-mini
Ask.ai is an AI-powered search engine specifically
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 7008
This neuron spots the assistant’s self-descriptive policy-and-safety statements, especially “I cannot/absolutely cannot” refusals based on its programming constraints.
o4-mini
Safety:** My purpose is to provide a safe and
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 8163
The neuron fires on self-referential AI identity phrases (e.g. “As a large language model,” “AI,” “model,” etc.).
o4-mini
As a large language model, I am **not**
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 100587
statements where the speaker issues a disclaimer identifying themselves as an AI and noting limitations or non-advisory status.
gpt-5
Disclaimer:** I am an AI and cannot provide financial advice
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 4874
The neuron activates on the self‐referential “As a large language model” style disclaimer phrase.
o4-mini
↵<start_of_turn>model↵As a large language model, I
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 26409