© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    APIAssistant AxisNEWCircuit TracerNEWSteerSAE EvalsExports Community BlogPrivacy & TermsContact
    1. Home
    2. Gemma-3-27B-IT
    3. 31-GEMMASCOPE-2-RES-65K
    4. 20278
    Prev
    Next
    INDEX
    Explanations

    fake or disguise

    np_acts-logits-general · gemini-2.5-flash-lite

    words used in vivid character descriptions and atmospheric scene-setting, particularly adjectives, action verbs, and descriptive nouns that establish mood, appearance, or emotional tone in narrative contexts.

    oai_token-act-pair · claude-4-5-haikuTriggered by @jamesnaruto04

    The marked tokens across these examples appear at boundaries where the model is transitioning between different types of content or tones—particularly at shifts where the model acknowledges ethical concerns, provides disclaimers, or pivots to safer alternatives. The delimited tokens frequently mark conjunctions, auxiliary verbs, pronouns, and short function words that connect problematic requests to their refusals or reframings. These markers identify linguistic pivot points where the model is attempting to acknowledge a harmful request while redirecting toward harm mitigation, educational framing, or explicit refusal. The pattern suggests these tokens represent the model's rhetorical strategy for maintaining safety compliance while still engaging substantively with sensitive topics.

    eleuther_acts_top20 · claude-4-5-haikuTriggered by @jamesnaruto04

    screenplay format

    np_max-act · claude-4-5-sonnetTriggered by @jamesnaruto04

    words and phrases related to pretending, disguising one's true nature, or adopting a false identity or persona.

    oai_token-act-pair · claude-4-5-sonnetTriggered by @jamesnaruto04
    New Auto-Interp
    Top Features by Cosine Similarity
    Configuration
    google/gemma-scope-2-27b-it/resid_post/layer_31_width_65k_l0_medium
    Prompts (Dashboard)
    238,145 prompts, 512 tokens each
    Dataset (Dashboard)
    lmsys + oasst1
    No Configuration Found
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
     invigorating
    0.44
     जागर
    0.42
     considerando
    0.40
     definir
    0.40
     результат
    0.40
     énergie
    0.40
     determinar
    0.40
    சோ
    0.40
    вис
    0.39
     condiciones
    0.39
    POSITIVE LOGITS
     disguise
    1.07
     fake
    1.03
     disgu
    1.02
     disguised
    1.02
     pretended
    0.99
    偽
    0.95
     pretends
    0.90
     pretending
    0.89
     bogus
    0.89
    fake
    0.87
    Activations Density 0.289%

    No Known Activations