EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
Mentions of feedback—especially feedback loops or iterative feedback mechanisms.
gpt-5-mini
izing algorithms or by incorporating feedback from external sources.↵2
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 28851
This neuron detects document-level headings, titles and other structural/metadata elements (table-of-contents and section headings).
gpt-5-mini
ENTS↵↵↵THE BOOMING OF ACRE HILL
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 44904
It detects technical chemical/chemical‑industry terminology—especially long chemical compound names and words about synthesis/production.
gpt-5-mini
oxepin-9-one involves the synthesis of the
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 52414
This neuron detects mentions of the concept "focus" — appearing as that token (including in brand/product names) or in phrases about focusing, feedback, or focus groups.
gpt-5-mini
our March 4th Focus Flight!↵.↵Seats are
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 104362
This neuron detects speaker/role labels and dialogue-turn markers (tokens identifying who is speaking or header IDs).
gpt-5-mini
deal with.↵↵Person B: I'm sorry to hear
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 105095
document-structure markers and formal section headings (e.g., Title, Abstract, Introduction, Section labels).
gpt-5-mini
recent citations, and how it impacts the estimated costs for
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 127785
The neuron detects self-referential/reflexive language—tokens and phrases that refer back to the subject itself (e.g., "self", "itself", "talking to itself", "self-...").
gpt-5-mini
Government; literally the system talking to itself!↵↵While our
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 110626
References to rises in intracellular calcium ([Ca2+]i) and related channel activity/activation (often tied to proteolytic activity or positive-feedback signaling) in biomedical text.
gpt-5-mini
proteolytic activity in a positive feedback loop, leading
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 77783
Detects self-referential or meta discussion about storytelling/theatre (mentions of stories, plays, writing about writing, and meta-theatrical commentary).
gpt-5-mini
kind of meta-theater that questions the nature of art
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 51376
It detects content-bearing verbs and descriptive adjectives—action or property words used in narratives.
gpt-5-mini
floating by and decided to embrace it.↵As she watched
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 27476
the neuron's main role is to detect words about professional qualifications, experience, goals, motivations, and other job/role-related attributes.
gpt-5-mini
applicant about their goals and motivations. This will help you
LLAMA3.1-8B-IT
19-RESID-POST-AA
INDEX 57191
comma-separated lists, item enumeration, and sequential parallel structures.
claude-4-5-haiku
bank’s involvement in 1MDB, described as “
QWEN3-4B
11-TRANSCODER-HP
INDEX 139180
words related to correctness, accuracy, or doing something right or wrong.
claude-4-5-sonnet
indicating how many problems are correctly answered and how many are
QWEN3-4B
10-TRANSCODER-HP
INDEX 121413
statements evaluating correctness or accuracy of responses or decisions, including references to errors.
gpt-5
indicating how many problems are correctly answered and how many are
QWEN3-4B
10-TRANSCODER-HP
INDEX 121413
statements about correctness and accuracy, including references to errors and evaluation of answer or task performance.
gpt-5
indicating how many problems are correctly answered and how many are
QWEN3-4B
11-TRANSCODER-HP
INDEX 82089
references to performance evaluation—statements about correctness, incorrectness, error counts, and measured accuracy or success in tasks or tests.
gpt-5
the problems which are answered incorrectly and/or practice more.
QWEN3-4B
9-TRANSCODER-HP
INDEX 89857
words and phrases indicating defeat, losing, or trailing behind in competitive contexts.
claude-4-5-sonnet
Last year's quarter-final loss to comp
QWEN3-4B
9-TRANSCODER-HP
INDEX 151143
descriptions of unfavorable status or outcomes—being behind, losing, or otherwise negative—often in competitive or evaluative contexts.
gpt-5
Last year's quarter-final loss to comp
QWEN3-4B
9-TRANSCODER-HP
INDEX 151143
words with negative connotations indicating problems, defects, wrongness, or undesirable qualities.
claude-4-5-sonnet
and allows the use of invalid certificates. When using the
QWEN3-4B
11-TRANSCODER-HP
INDEX 128789
language signaling problems, faults, or negative conditions, often marked by negation or deficiency (e.g., invalidity, lack, inability, mis- states, failures, issues, or harms)
gpt-5
and allows the use of invalid certificates. When using the
QWEN3-4B
11-TRANSCODER-HP
INDEX 128789