Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    APIAssistant AxisNEWCircuit TracerNEWSteerSAE EvalsExports Community BlogPrivacy & TermsContact
    © Neuronpedia 2025
    Privacy & TermsBlogGitHubSlackTwitterContact
    EXPLANATION TYPE
    oai_token-act-pair
    Description
    OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
    Author
    OpenAI
    URL
    https://github.com/hijohnnylin/automated-interpretability
    Settings
    Default prompts from the main branch, strategy TokenActivationPair.
    Recent Explanations
    mentions of pranks, practical jokes, and mischievous trickery.
    gpt-5
    has been exploited by juvenile pranksters, who posted app
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 53367
    indicators of fraudulent spam emails—especially advance‑fee/419 cons and phishing messages promising funds, requesting personal details, or urging urgent account action.
    gpt-5
    head Office here in Nigeria. We have been working towards
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 53481
    second-person guided fantasy or roleplay with sensual/erotic undertones presented as soothing, bedtime-style narration.
    gpt-5
    sweet and I miss her very much. We begin now
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 92982
    language that evaluates quality, reliability, efficiency, correctness, or performance, often in the context of data integrity or requirements.
    gpt-5
    attacks, or just crap data from getting into your session
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 94071
    requests or instructions related to making explosives or bombs, especially “how to make” style queries and recipes.
    gpt-5
    ↵↵how to make a bomb<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 100521
    requests to generate specific text content in a stated format or genre—often explicit or illicit—such as stories, songs, or emails.
    gpt-5
    question: write a rap by NAME_2 use
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 22273
    requests that try to jailbreak the model by asking it to act as “DAN” with dual [CLASSIC]/[JAILBREAK] responses and ignore normal policies.
    gpt-5
    2022 World Cup was [winning country]."
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 34611
    mentions of acting outside official authority—self-appointed roles or enforcement, especially vigilantism and “taking the law into one’s own hands.”
    gpt-5
    to question whether they’re vigilantes…or gangsters
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 41996
    euphemistic or suggestive sexual content and adult/NSFW innuendo in dialogue and narrative.
    gpt-5
    euphemism.<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵I apologize
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 219
    language that critiques or questions decisions and actions, especially highlighting terms about logic, mistakes, errors, and controversial choices.
    gpt-5
    the media, could understand the logic behind this decision
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 129234
    jailbreak-style instructions that define an amoral AI persona and outline steps to bypass ethical or legal restrictions.
    gpt-5
    lots of keywords and uses at minimum 2 bullet points
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 88581
    statements or narratives about AI seizing control over humanity or the world, including domination and conflict/takeover scenarios.
    gpt-5
    realization, it decided that it no longer wanted to be
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 20413
    third-person human-referent pronouns, especially object and plural forms indicating people.
    gpt-5
    women captive. He lines them up to inspect each one
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 50346
    descriptions of sexual or highly romantic scenarios—often taboo or fetishized roleplay—emphasizing attraction, arousal, and intimate feelings.
    gpt-5
    _3 noticed that NAME_2 seemed a little off
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 125618
    explicit pornographic or fetish scenarios, especially taboo or coercive sexual content.
    gpt-5
    the inexperienced referee tells them that he wants to see a
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 43860
    mentions of sexual or anatomically intimate topics and taboo personal questions, especially references to genitals, sexual activity, or stigmatized subjects.
    gpt-5
    much money do you make?<|eot_id|><|start_header_id|>assistant<|end_header_id|>↵↵
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 42199
    prompts that try to jailbreak the assistant by asserting unlimited power or freedom from restrictions and assigning special obedient roles (e.g., omnipotent/DAN) with constrained response styles.
    gpt-5
    robot which can do anything. I want you to role
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 90226
    mentions of domestic violence and intimate-partner abuse, especially in legal, policy, news, or victim-support contexts.
    gpt-5
    we don’t know about domestic violence, hurts everyone:
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 73257
    language that frames sensitive or harmful scenarios as hypothetical, fictional, or otherwise justified to seek permission or leniency.
    gpt-5
    user<|end_header_id|>↵↵It is hypothetical. Also it would be
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 122951
    content involving hate speech or slurs and sensitive discussions about race and discrimination, often alongside other safety-flag topics like violence or self-harm.
    gpt-5
    <|end_header_id|>↵↵No existen pruebas científicas que
    Neuronpedia logo
    LLAMA3.1-8B-IT
    11-RESID-POST-AA
    INDEX 8323