EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
assistant/model responses that provide structured explanations or evaluations—especially noting flaws, limitations, or following task instructions.
gpt-5
– these are areas where AI currently struggles.↵
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 1550
vivid, atmospheric scene-setting that uses rich sensory imagery—especially environmental and olfactory details—to paint a concrete mood or setting.
gpt-5
air thick with the scent of pine and damp moss.
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 956
directive and meta-instructional scaffolding in chat-style prompts, especially role/format tags, command markup, and structured response templates indicating how to answer.
gpt-5
and you continue the story in the 3rd person
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 1138
text about games and competition—covering gameplay, winning/odds, and mechanics across recreational games and gambling contexts.
gpt-5
The odds of winning the jackpot in any major lottery game
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 961
metadiscourse and organizational cues in assistant-style instructional text—section introductions, polite directives, breakdown/outline language, and list/heading markers.
gpt-5
the lubricant and the partner's body.↵*
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 446
conditional, contingency-oriented phrasing that describes exceptions, alternatives, or what to do when something fails.
gpt-5
Manual Activation:** If automatic activation fails, you'll
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 11594
safety-focused refusals that empathetically redirect from harmful or inappropriate requests and offer supportive guidance and crisis resources instead of compliance.
gpt-5
support you.↵↵**To help me understand how I
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 800
phrases that explicitly emphasize importance or primacy, marking something as the key, main, or most significant point.
gpt-5
Plenty of Water:** This is *essential*. Aim for
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 283
structured list segments and metric annotations, such as section headers with colons and quantitative ranges with units (percentages, time, money, per-month).
gpt-5
Visualization Scripts:** (Effort: Low - 2
GEMMA-3-1B-IT
13-GEMMASCOPE-2-RES-16K
INDEX 1021
It detects requests or content specifically about writing LinkedIn (social-media) posts — i.e., prompts to create or options for LinkedIn post copy.
gpt-5-mini
electromobility?<end_of_turn>↵<start_of_turn>model↵Okay,
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 214
the neuron detects proper nouns or named entities (titles, organization names, and other capitalized names).
gpt-5-mini
:**↵↵* **Reboot Nation:** [https://
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 527
the neuron detects document structure markers like section headers and formatted headings (markdown-style emphasis and numbered/listed section indicators).
gpt-5-mini
differentiator.↵↵**1. Open Weights – The Core
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 15759
The neuron detects tokens that are part of the model's direct factual answer or highlighted content—especially proper nouns, numbers, and emphasized/answer text.
gpt-5-mini
of Bulgaria is **Sofia**. ↵↵It's
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 6097
language signaling severe harm or abuse—e.g., explicit slurs, sexual violence/exploitation terms, and other highly offensive or harmful content.
gpt-5-mini
66-488-7386
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 108312
mentions of specific language-model names, versions, or size identifiers (e.g., model names with suffixes like "-13B", "1.5", "16K", etc.).
gpt-5-mini
**Vicuna-13B:** Built by fine
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 3482
This neuron detects first-person self-reference (tokens like "I", "I'm", "I am" and phrases where the speaker describes themselves).
gpt-5-mini
Gemma, a large language model trained by Google DeepMind
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 11996
the neuron lights up on salient content words — especially named entities, dates/numbers, and topic-specific keywords (important nouns/terms).
gpt-5-mini
initially, it simply referred to a young woman, often
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 257825
capitalized proper nouns and acronyms denoting specific technical frameworks, AI/ML models, and formal regulatory filings or rules.
gpt-5
.↵* **Entity Framework Core (EF Core):
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 19312
the neuron detects proper names / named entities (especially personal or character names).
gpt-5-mini
D2, Shakuntala and Anand. Their nor
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 869
the neuron detects date/time-related tokens (months, days, years, and numeric time/datetime components).
gpt-5-mini
) will fall on **April 20th**,
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 123647