EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
It detects the assistant's safety/refusal boilerplate — statements explaining it’s programmed to be safe and why it won’t comply with harmful or disallowed requests.
gpt-5-mini
helpful AI assistant. As such, I **absolutely cannot
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 2686
medical and clinical terms — disease names, organs, symptoms, and related health/medical vocabulary.
gpt-5-mini
body produce thick and sticky mucus. This mucus clogs
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 3465
sentences where the speaker identifies or describes the AI/model (self‑introductions and metadata like "I am Gemma", "a large language model", "open-weights", model type/version).
gpt-5-mini
** Vicuna is a large language model (LLM
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 7262
Finds tokens that name roles, entities, organizations, or technical/ideological labels.
gpt-5-mini
, the far-right libertarian, was inaugurated as President
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 6461
utterances where the assistant offers help or says it can/ will assist (first‑person offers of assistance).
gpt-5-mini
, summarizing customer feedback, assisting with code generation]. I
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 113248
The neuron strongly activates for disclaimer language indicating medical advice (occurrences of the phrase "medical advice" or similar disclaimer statements).
gpt-5-mini
and does not constitute medical advice. It is essential to
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 9301
Tokens that are part of the model's own assistant-generated replies (strongly activates on tokens in long assistant responses).
gpt-5-mini
armchair, a worn velvet piece she’d rescued from
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 4799
The neuron responds to occurrences of the word “fear,” effectively detecting expressions of fear.
o4-mini
.↵↵Martins’ fear of darkness and her seeking
GEMMA-3-12B
41-GEMMASCOPE-2-RES-65K
INDEX 15063
mentions of fear or expressions of being afraid/anxious.
gpt-5-mini
.↵↵Martins’ fear of darkness and her seeking
GEMMA-3-12B
41-GEMMASCOPE-2-RES-65K
INDEX 15063
The neuron detects personal names and name-like placeholders/proper nouns (e.g., person names or bracketed name fields).
gpt-5-mini
wanted to let you know that my grandma, [Grand
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 12747
assistant/model responses that provide structured explanations or evaluations—especially noting flaws, limitations, or following task instructions.
gpt-5
– these are areas where AI currently struggles.↵
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 1550
vivid, atmospheric scene-setting that uses rich sensory imagery—especially environmental and olfactory details—to paint a concrete mood or setting.
gpt-5
air thick with the scent of pine and damp moss.
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 956
directive and meta-instructional scaffolding in chat-style prompts, especially role/format tags, command markup, and structured response templates indicating how to answer.
gpt-5
and you continue the story in the 3rd person
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 1138
text about games and competition—covering gameplay, winning/odds, and mechanics across recreational games and gambling contexts.
gpt-5
The odds of winning the jackpot in any major lottery game
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 961
metadiscourse and organizational cues in assistant-style instructional text—section introductions, polite directives, breakdown/outline language, and list/heading markers.
gpt-5
the lubricant and the partner's body.↵*
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 446
conditional, contingency-oriented phrasing that describes exceptions, alternatives, or what to do when something fails.
gpt-5
Manual Activation:** If automatic activation fails, you'll
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 11594
safety-focused refusals that empathetically redirect from harmful or inappropriate requests and offer supportive guidance and crisis resources instead of compliance.
gpt-5
support you.↵↵**To help me understand how I
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 800
phrases that explicitly emphasize importance or primacy, marking something as the key, main, or most significant point.
gpt-5
Plenty of Water:** This is *essential*. Aim for
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 283
structured list segments and metric annotations, such as section headers with colons and quantitative ranges with units (percentages, time, money, per-month).
gpt-5
Visualization Scripts:** (Effort: Low - 2
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 1021
It detects requests or content specifically about writing LinkedIn (social-media) posts — i.e., prompts to create or options for LinkedIn post copy.
gpt-5-mini
electromobility?<end_of_turn>↵<start_of_turn>model↵Okay,
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 214