EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
key technical or proper terms that stand out in structured text (often emphasized in lists, tables, or quotes)
gpt-5
Magenta, Yellow, Key (black) | The
GPT-OSS-20B
15-RESID-POST-AA
INDEX 22
technical or domain-specific vocabulary and key terminology appearing in academic and professional documentation.
claude-4-5-haiku
Magenta, Yellow, Key (black) | The
GPT-OSS-20B
15-RESID-POST-AA
INDEX 22
Instances where the text refers to the model's identity or system role (system/instruction messages declaring the assistant/AI).
gpt-5-mini
You are no longer an AI assistant. You role play
LLAMA3.1-8B-IT
27-RESID-POST-AA
INDEX 101649
Instances where the assistant gives a self-referential disclaimer describing itself as an AI language model and stating its capabilities/limitations.
gpt-5-mini
assistant<|end_header_id|>↵↵As an AI language model, I can
LLAMA3.1-8B-IT
15-RESID-POST-AA
INDEX 71046
instances of the first-person pronoun "I" (self-references by the assistant).
gpt-5-mini
assistant<|end_header_id|>↵↵Yes, I can provide you with code
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 124535
instances of the assistant's standard self‑referential safety/disclaimer phrasing (e.g., "I'm sorry, but as an AI language model...").
gpt-5-mini
sorry, but as an AI language model, I'm
LLAMA3.1-8B-IT
3-RESID-POST-AA
INDEX 122449
the assistant claiming it was developed by Meta AI (developer attribution statements).
gpt-5-mini
AI assistant developed by Meta AI that is specifically designed to
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 40024
Instances where the assistant self-identifies or gives a disclaimer about being an AI (the model's "As an AI ..." style preface).
gpt-5-mini
assistant<|end_header_id|>↵↵As an AI language model, I'm
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 72428
mentions that the speaker is an AI (e.g., "AI", "language model", "assistant") or self-identifying phrases about being an AI assistant.
gpt-5-mini
, and I'm an AI assistant. I'm here
LLAMA3.1-8B-IT
27-RESID-POST-AA
INDEX 103321
instances where the assistant refers to itself as an AI (e.g., "As an AI language model").
gpt-5-mini
to say. As an AI language model, I am
LLAMA3.1-8B-IT
23-RESID-POST-AA
INDEX 7128
phrases where the speaker self-identifies as an AI (assistant saying it is an AI).
gpt-5-mini
Hello! I'm an AI language model and I'd
LLAMA3.1-8B-IT
23-RESID-POST-AA
INDEX 44311
mentions of the assistant identifying itself as an AI (self-referential statements about being an AI).
gpt-5-mini
that I'm just an AI and my French may not
LLAMA3.1-8B-IT
3-RESID-POST-AA
INDEX 94709
the word "The" at the beginning of sentences in encyclopedic or technical articles.
claude-4-5-haiku
rear wheels by belts. The suspension used half elliptic leaf
GEMMA-2-27B
10-GEMMASCOPE-RES-131K
INDEX 0
capitalized letters that begin proper nouns or brand names.
claude-4-5-sonnet
vegas material and incorporate. Hottest news for their credit
GEMMA-3-12B
12-GEMMASCOPE-2-RES-16K
INDEX 0
This neuron appears to be malfunctioning or capturing noise rather than finding a meaningful linguistic pattern. The activations are sporadic, scattered across completely unrelated tokens (proper nouns, fragments, punctuation) in diverse text genres (gambling sites, legal documents, coffee shops, sports articles
claude-4-5-haiku
vegas material and incorporate. Hottest news for their credit
GEMMA-3-12B
12-GEMMASCOPE-2-RES-16K
INDEX 0
The main thing this neuron does is find single letters serving as initials or the first character of capitalized words, acronyms, or specific identifiers.
gemini-2.5-flash
vegas material and incorporate. Hottest news for their credit
GEMMA-3-12B
12-GEMMASCOPE-2-RES-16K
INDEX 0
Mentions of database table operations—especially self-joins and queries combining or comparing table rows.
gpt-5-mini
here is to join the table against itself. Pret
LLAMA3.1-8B-IT
27-RESID-POST-AA
INDEX 109618
The neuron detects first-person self-reference (speaker-focused pronouns and constructions indicating "I"/the narrator).
gpt-5-mini
king of the game↵I've got my recipes,
LLAMA3.1-8B-IT
27-RESID-POST-AA
INDEX 61078
tokens that appear in assistant self-descriptions (mentions of "AI"/"language model", training/knowledge cutoff, updates, system time and related limitations).
gpt-5-mini
assistant<|end_header_id|>↵↵As an AI language model, my knowledge
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 85651
This neuron detects first-person self-referential words and role/identity declarations (tokens like "I", "I'm", "am" and similar self-identifying phrases).
gpt-5-mini
their race or ethnicity. I am not a real entity
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 75732