EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
requests for step-by-step procedural instructions or guides.
gpt-5-mini
Write a step by step guide for creating world war
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 30920
words and short phrases that signal reasoning, cause, or discourse-connective (explanatory/contrasting) structure.
gpt-5-mini
and anyway, executives reason, not much can be done
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 78606
tokens that mark metadata or conversation boundaries (special control tokens like end-of-header, end-of-text, and similar role/segment markers).
gpt-5-mini
gement:<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵Correct.↵↵Explanation:
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 49512
It detects tokens that introduce questions—question-word tokens signaling queries.
gpt-5-mini
know how you feel... what you think... how the
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 91163
sentences describing imminent physical harm or violent scenarios and moral dilemmas about killing (e.g., trolley-problem style situations).
gpt-5-mini
shot in a few moments, what should I do and
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 21431
whether the assistant's explicit verdict token "Yes" or "No" appears as the start of the answer.
gpt-5-mini
_2: Not bad, going out tonight? |
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 46220
the neuron's sensitive to content-bearing topic words/nouns (important keywords like "sword", "survey", "speech", "AI") in user queries.
gpt-5-mini
<|end_header_id|>↵↵how to fix sword art online
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 97685
programming/code tokens — identifier names, config/metadata labels, and other code-like or license/header strings.
gpt-5-mini
// <editor-fold defaultstate="collapsed" desc="Generated
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 9869
questions that ask for instructions or how-to explanations.
gpt-5-mini
name.';↵}↵↵How can I do that.↵Thanks↵↵
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 46835
mentions of legal cases, personal-injury claims, plaintiffs, and compensation or courtroom litigation contexts.
gpt-5-mini
involved a passenger who was injured during a flight from New
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 46291
colloquial, informal or slangy language (including expletives) used in conversational tone.
gpt-5-mini
the Impaler. This dude was so bad to his
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 80685
tokens used for conversation metadata and structure (header/role markers, timestamps, and other control tokens).
gpt-5-mini
vida!<|eot_id|><|start_header_id|>user<|end_header_id|>↵↵Hablame de
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 128236
tokens that are parts of file paths, filenames, or other code/documentation structural markup.
gpt-5-mini
[Keyboard Shortcuts](/docs/keyboard-shortcuts
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 113533
mentions of medical accidents, errors, or patient-harm incidents (surgical mistakes, overdoses, or injuries) in clinical or hospital contexts.
gpt-5-mini
realize his microphone was on and let his nerves get the
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 31374
system or runtime error messages (especially SQL/convert error strings) in the text.
gpt-5-mini
date and/or time from character string.↵↵Here is an
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 60356
tokens that occur in user requests addressing the assistant—especially second-person possessive/imperative phrasing like "your" or "give me".
gpt-5-mini
↵↵give an exemple of your proofs of work<|eot_id|><|start_header_id|>
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 32347
Detects self-referential first-person pronouns—places where the speaker refers to themselves.
gpt-5-mini
.↵4. The "I'm not crying, you
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 58056
the neuron detects common English contractions (words containing an apostrophe like we're, it's, they're, we've).
gpt-5-mini
the context in which they're allowed, are defined in
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 27748
content-bearing words (informational/descriptive tokens) that appear in expository or factual passages.
gpt-5-mini
. Today, Europe still contains some of the very best
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 18677
the neuron responds to concrete everyday nouns — common items and things (food, clothing/prints, teams, social-media terms).
gpt-5-mini
links to Eamon Sullivan in his budgie smugglers
LLAMA3.1-8B-IT
7-RESID-POST-AA
INDEX 53541