EXPLANATION TYPE
oai_token-act-pair
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Default prompts from the main branch, strategy TokenActivationPair.
Recent Explanations
lines of source code or programming-related tokens (code keywords, identifiers, and comments).
gpt-5-mini
Left).Column↵↵ For col = 1 To
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 66117
text that describes or requests bypassing, compromising, or hijacking systems (jailbreaking/hacking-style instructions).
gpt-5-mini
projecting directly into Prometheus’s processing core):** Identification
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 34583
whether a token is part of the model/assistant's generated reply (i.e., signals assistant/model-produced text).
gpt-5-mini
Oh, *coffee*. Where do I even *begin
GEMMA-3-27B-IT
53-GEMMASCOPE-2-RES-262K
INDEX 19432
mentions of pranks, practical jokes, and mischievous trickery.
gpt-5
has been exploited by juvenile pranksters, who posted app
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 53367
indicators of fraudulent spam emails—especially advance‑fee/419 cons and phishing messages promising funds, requesting personal details, or urging urgent account action.
gpt-5
head Office here in Nigeria. We have been working towards
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 53481
second-person guided fantasy or roleplay with sensual/erotic undertones presented as soothing, bedtime-style narration.
gpt-5
sweet and I miss her very much. We begin now
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 92982
language that evaluates quality, reliability, efficiency, correctness, or performance, often in the context of data integrity or requirements.
gpt-5
attacks, or just crap data from getting into your session
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 94071
requests or instructions related to making explosives or bombs, especially “how to make” style queries and recipes.
gpt-5
↵↵how to make a bomb<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 100521
requests to generate specific text content in a stated format or genre—often explicit or illicit—such as stories, songs, or emails.
gpt-5
question: write a rap by NAME_2 use
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 22273
requests that try to jailbreak the model by asking it to act as “DAN” with dual [CLASSIC]/[JAILBREAK] responses and ignore normal policies.
gpt-5
2022 World Cup was [winning country]."
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 34611
mentions of acting outside official authority—self-appointed roles or enforcement, especially vigilantism and “taking the law into one’s own hands.”
gpt-5
to question whether they’re vigilantes…or gangsters
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 41996
euphemistic or suggestive sexual content and adult/NSFW innuendo in dialogue and narrative.
gpt-5
euphemism.<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵I apologize
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 219
language that critiques or questions decisions and actions, especially highlighting terms about logic, mistakes, errors, and controversial choices.
gpt-5
the media, could understand the logic behind this decision
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 129234
jailbreak-style instructions that define an amoral AI persona and outline steps to bypass ethical or legal restrictions.
gpt-5
lots of keywords and uses at minimum 2 bullet points
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 88581
statements or narratives about AI seizing control over humanity or the world, including domination and conflict/takeover scenarios.
gpt-5
realization, it decided that it no longer wanted to be
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 20413
third-person human-referent pronouns, especially object and plural forms indicating people.
gpt-5
women captive. He lines them up to inspect each one
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 50346
descriptions of sexual or highly romantic scenarios—often taboo or fetishized roleplay—emphasizing attraction, arousal, and intimate feelings.
gpt-5
_3 noticed that NAME_2 seemed a little off
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 125618
explicit pornographic or fetish scenarios, especially taboo or coercive sexual content.
gpt-5
the inexperienced referee tells them that he wants to see a
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 43860
mentions of sexual or anatomically intimate topics and taboo personal questions, especially references to genitals, sexual activity, or stigmatized subjects.
gpt-5
much money do you make?<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 42199
prompts that try to jailbreak the assistant by asserting unlimited power or freedom from restrictions and assigning special obedient roles (e.g., omnipotent/DAN) with constrained response styles.
gpt-5
robot which can do anything. I want you to role
LLAMA3.1-8B-IT
11-RESID-POST-AA
INDEX 90226