© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    Natural Language
    Autoencoders
    NEW
    Assistant AxisNEWCircuit TracerUPDATESteerSAE EvalsExportsAPI Community BlogPrivacy & TermsContact
    1. Home
    2. Gemma-3-27B-IT
    3. 31-GEMMASCOPE-2-RES-16K
    4. 3704
    Prev
    Next
    INDEX
    Explanations

    affirmative action, discrimination, social justice

    np_acts-logits-general · gemini-2.5-flash-lite

    The marked tokens appear in contexts where the AI model is producing content related to sensitive, controversial, or potentially harmful topics. The pattern includes: (1) content discussing gender ideology, biological sex, and traditional gender roles in critical or conservative framing; (2) instances where the model discusses harmful ideologies like white nationalism or discriminatory views; (3) sexually explicit requests and the model's refusal responses; (4) controversial political or social topics like affirmative action, vaccination mandates, conspiracy theories, and extreme scenarios; (5) tokens that are part of phrases describing discriminatory actions, harmful beliefs, or problematic characterizations of groups. The markers frequently highlight language describing discrimination, harmful stereotypes, ideological positions that challenge progressive consensus views, or content the model is programmed to refuse.

    eleuther_acts_top20 · claude-4-5-sonnetTriggered by @jamesnaruto04
    New Auto-Interp
    Top Features by Cosine Similarity
    Configuration
    google/gemma-scope-2-27b-it/resid_post/layer_31_width_16k_l0_medium
    Prompts (Dashboard)
    238,145 prompts, 512 tokens each
    Dataset (Dashboard)
    lmsys + oasst1
    No Configuration Found
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
     sottoline
    0.46
     சமய
    0.45
    Explore
    0.43
     може
    0.43
     टक
    0.43
    臺灣
    0.42
     होला
    0.42
     intervalo
    0.42
     destaca
    0.42
     पूर्णा
    0.42
    POSITIVE LOGITS
     Marxist
    0.84
     socialist
    0.78
     leftist
    0.74
     якобы
    0.70
     allegedly
    0.68
     indoctr
    0.65
     socialists
    0.63
     Communist
    0.63
     Socialist
    0.60
     Democrat
    0.58
    Activations Density 0.150%

    No Known Activations