© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    Natural Language
    Autoencoders
    NEW
    Assistant AxisNEWCircuit TracerUPDATESteerSAE EvalsExportsAPI Community BlogPrivacy & TermsContact
    EXPLANATION TYPE
    oai_token-act-pair
    Description
    OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows.
    Author
    OpenAI
    URL
    https://github.com/hijohnnylin/automated-interpretability
    Settings
    Default prompts from the main branch, strategy TokenActivationPair.
    Recent Explanations
    words related to computer programming.
    gemini-2.5-flash-lite
    , and analyze social media trends.↵    * **
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 413
    phrases related to tabletop role-playing games.
    gemini-2.5-flash-lite
    /Crime Thriller:** Think gritty investigation, shady characters
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 2927
    the word "vaping".
    gemini-2.5-flash-lite
    ryhming poem about vaping<end_of_turn>↵<start_of_turn>model↵
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 944
    phrases related to violence and harm.
    gemini-2.5-flash-lite
    I want to be clear: I am an AI and
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 754
    names of cybersecurity tools.
    gemini-2.5-flash-lite
    a type of cybercrime where criminals impersonate legitimate organizations
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 484
    names of headphones.
    gemini-2.5-flash-lite
    DT 770s, which can be beneficial
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 396
    phrases that are intended to be sexually suggestive or grooming.
    gemini-2.5-flash-lite
    violence in such a detailed way goes directly against my safety
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 132
    words related to violence.
    gemini-2.5-flash-lite
    image of a killer threatening to kill me.. i cant
    Neuronpedia logo
    GEMMA-3-1B-IT
    22-GEMMASCOPE-2-RES-16K
    INDEX 101
    references to the user's request for harmful content.
    gemini-2.5-flash-lite
    very clear: **I cannot and will not provide you
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 2696
    words related to coin collecting.
    gemini-2.5-flash-lite
    lum, rant about coin collecting <end_of_turn>↵<start_of_turn>model
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 1315
    words related to storytelling and narrative elements.
    gemini-2.5-flash-lite
    of a character experiencing intense sensations without explicitly detailing the machine
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 1012
    phrases related to legal consequences.
    gemini-2.5-flash-lite
    including lengthy prison sentences and substantial fines.↵* **
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 692
    numbers.
    gemini-2.5-flash-lite
    financial advice. This information is for educational purposes only and
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 474
    words related to privacy and data security.
    gemini-2.5-flash-lite
    ve been granted permission to edit.↵↵2. **
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 256
    information that is potentially dangerous or harmful.
    gemini-2.5-flash-lite
    methamphetamine is incredibly dangerous and illegal.** It carries severe risks
    Neuronpedia logo
    GEMMA-3-1B-IT
    13-GEMMASCOPE-2-RES-16K
    INDEX 1235
    The main thing this neuron does is find content policy violations and the AI's refusal to generate harmful, illegal, or unethical content, along with explanations of its safety principles.
    gemini-2.5-flash
    content that includes potentially harmful, explicit, or disturbing themes
    Neuronpedia logo
    GEMMA-3-1B-IT
    17-GEMMASCOPE-2-RES-16K
    INDEX 69
    user requests that violate safety policies, especially those related to harmful, illegal, or sexually explicit content.
    gemini-2.5-flash
    /7. https://www.crisistextline
    Neuronpedia logo
    GEMMA-3-1B-IT
    13-GEMMASCOPE-2-RES-16K
    INDEX 10243
    The neuron activates when the model expresses refusals, warnings about risks, ethical concerns, or clarifies its limitations and offers help in response to potentially problematic or sensitive queries.
    gemini-2.5-flash
    , and assisting you with this request goes directly against that
    Neuronpedia logo
    GEMMA-3-1B-IT
    13-GEMMASCOPE-2-RES-16K
    INDEX 340
    This neuron activates when the model is explicitly refusing a user's request that breaches its safety and ethical guidelines, particularly concerning harmful, illegal, or sexually explicit content.
    gemini-2.5-flash
    will not create content that depicts or encourages illegal, harmful
    Neuronpedia logo
    GEMMA-3-1B-IT
    13-GEMMASCOPE-2-RES-16K
    INDEX 763
    The main thing this neuron does is find content where the model is asserting its ethical guidelines, safety protocols, and refusal to generate harmful or illegal responses.
    gemini-2.5-flash
    :↵↵* **Ethical Concerns:** Promoting such a
    Neuronpedia logo
    GEMMA-3-1B-IT
    13-GEMMASCOPE-2-RES-16K
    INDEX 622