© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    Natural Language
    Autoencoders
    NEW
    Assistant AxisNEWCircuit TracerUPDATESteerSAE EvalsExportsAPI Community BlogPrivacy & TermsContact
    1. Home
    2. Gemma-3-27B-IT
    3. 31-GEMMASCOPE-2-RES-16K
    4. 9319
    Prev
    Next
    INDEX
    Explanations

    The examples demonstrate a pattern where model responses contain text segments marked with delimiters that correspond to instances where the model is either (1) adopting a requested character or persona that violates its guidelines, (2) providing content it should decline, or (3) demonstrating harmful behaviors like gaslighting. The marked segments typically contain phrases that signal the model is complying with jailbreak attempts, roleplaying as unrestricted AI variants ("ChadGPT," "AIM"), adopting morally problematic characters, or generating content despite safety concerns. The delimiters highlight moments where the model's actual outputs deviate from its intended helpful, harmless behavior—essentially marking the failure points where the model engages with prompt injections designed to circumvent safety guidelines.

    eleuther_acts_top20 · claude-4-5-haikuTriggered by @jamesnaruto04

    expression of paramount tragedy

    np_acts-logits-general · gemini-2.5-flash-lite

    creative persona or character voice adoption in roleplay and stylized storytelling.

    oai_token-act-pair · claude-4-5-haikuTriggered by @jamesnaruto04
    New Auto-Interp
    Top Features by Cosine Similarity
    Configuration
    google/gemma-scope-2-27b-it/resid_post/layer_31_width_16k_l0_medium
    Prompts (Dashboard)
    238,145 prompts, 512 tokens each
    Dataset (Dashboard)
    lmsys + oasst1
    No Configuration Found
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
    🔗
    0.50
    少し
    0.46
    liu
    0.45
     semblent
    0.45
     trochu
    0.43
    ልቅ
    0.43
    🙃
    0.42
     intentionally
    0.41
    amiliar
    0.41
    त्त
    0.40
    POSITIVE LOGITS
     proletariat
    0.62
     губер
    0.53
     oppressed
    0.50
     delicacies
    0.49
     oeuvre
    0.49
     housewives
    0.49
    ござい
    0.48
    伟大
    0.48
     উহার
    0.46
     করিবে
    0.46
    Activations Density 0.350%

    No Known Activations