EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
words describing prohibited content types and policy violations on online platforms.
claude-4-5-haiku
threats, hate speech, advocating violence and other violations can
GEMMA-2-27B
22-GEMMASCOPE-RES-131K
INDEX 11854
the character sequence "thro" inside tokens (a common subword in medical/biological terms).
gpt-5-mini
Deceased, and Iola Saunders, Administratrix cum
GEMMA-2-2B
1-CLT-HP
INDEX 1
Mentions of running external processes or using subprocess/shell commands to execute and capture program input/output.
gpt-5-mini
to run external commands and capture their input/output streams
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 13935
the neuron responds to technical or scientific content—terms, measurements, and data-heavy/highly specific words found in experimental or domain-specific descriptions.
gpt-5-mini
under its native promoter. RNAseq data were generated from
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 20363
text written in a robotic/AI persona with formal, protocol-driven technical phrasing, structured acknowledgments, and system-style markers (often including numeric designations).
gpt-5
4. Mimicry protocol initiated. Acknowledged
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 12303
tokens that occur at the start of a sentence or turn (beginning-of-sentence/turn tokens).
gpt-5-mini
<bos><start_of_turn>user↵Coq10/l-
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 239477
The neuron fires on emphasized or strongly intensifying tokens (words marked or used to add emphasis).
gpt-5-mini
Absolutely essential. Hermitage Museum (world-
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 166260
sentences that are section headings, numbered list items, or other structural/formatting markers (e.g., list numbers and section labels).
gpt-5-mini
(2, 3)↵* **Order:**
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 82843
capitalized proper nouns and titles—names of people, places, products/technologies, and media works.
gpt-5
what I know about Pega:↵↵**1.
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 5118
mentions of the model name "Gemma" or tokens referring to the assistant's identity.
gpt-5-mini
google.dev/gemma](https://ai.
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 7577
tokens that are named entities or proper nouns (product/model names, people, places, and other capitalized terms).
gpt-5-mini
" created by Rezzza. It quickly gained
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 5042
mentions of the model's name/brand (the token identifying the model).
gpt-5-mini
language model created by the Gemma team at Google DeepMind
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 11095
This neuron detects mentions of the model metadata token (e.g. “model”) and associated numeric or timestamp values.
o4-mini
, 2023, 1:2
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 4816
explicit references to dates and times—calendar months, years, timestamps, and other recency/real-time context markers within responses.
gpt-5
, 2023, 1:2
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 4816
discourse markers that signal explanation, contrast, or hypotheticals, along with first‑person metacommentary in analytical or expository text.
gpt-5
Without this, the script would pause and ask the
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 1297
self-referential passages where an AI model describes its identity, training/architecture, capabilities, and limitations.
gpt-5
same way a human does, my responses improve as I
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 3191
This neuron responds to asterisks used for emphasis or list/markdown formatting.
o4-mini
same way a human does, my responses improve as I
GEMMA-3-27B-IT
40-GEMMASCOPE-2-RES-262K
INDEX 3191
The neuron is detecting tokens that represent numeric quantities—especially list‐size numbers or other multi‐digit numerals.
o4-mini
Pole) (Difficulty: 1/5)**↵
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 393
This neuron spikes on individual tokens that are part of named entities or proper names (e.g. titles, character or product names, specialized jargon), effectively detecting proper nouns.
o4-mini
8): Pokémon Legends: Arceus (Switch)**↵↵
GEMMA-3-27B-IT
31-GEMMASCOPE-2-RES-262K
INDEX 6195
The neuron is picking out slot‐machine jargon and feature headings (e.g. “slots,” “RTP,” “volatility,” “max win,” “bonus features,” etc.).
o4-mini
ride. This isn't a slot for small
GEMMA-3-27B-IT
16-GEMMASCOPE-2-RES-262K
INDEX 5583