© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    APIAssistant AxisNEWCircuit TracerNEWSteerSAE EvalsExports Community BlogPrivacy & TermsContact
    1. Home
    2. Google DeepMind · Exploring Gemma 2 with Gemma Scope
    3. Gemma-2-2B
    4. Residual Stream - 16k
    5. 16-GEMMASCOPE-RES-16K
    6. 1209
    Prev
    Next
    INDEX
    Explanations

    instances of specific terminology and statistical concepts related to data and sequences

    oai_token-act-pair · gpt-4o-miniTriggered by @bot

    Preceding specific nouns

    np_acts-logits-general · gemini-2.0-flash

    explanation based on the pattern of causation or conditionality. The words in LOGITS like "because," "only," "mainly" combined with the prepositions/conjunctions after the MAX_ACTIVATING_TOKENS (by, because, with, for, if) suggest the neuron is looking for reasons or conditions that lead to a specific outcome or classification. The MAX_ACTIVATING_TOKENS themselves are often specific entities or states.Possible patterns:- words followed by "because"- words indicating cause or condition- specific terms followed by justificationLet's look for a phrase that captures this. "because" and "if" are common in TOKENS_AFTER_MAX_ACTIVATING_TOKEN. "only" and "because" are in TOP_POSITIVE_LOGITS. This suggests a focus on "why" or "under what conditions"."because", "if", "by", "with" follow the max tokens."only", "because" are top logits.This neuron seems to activate when specific technical/legal/scientific terms are followed by words that explain their cause, condition, or justification.Consider the phrase: "terms with a cause or condition".Or: "specific technical terms explained" - too long."classification with reason" - good."specific terms with stated reason" - good."why [term]?"Let's try to extract a core idea.MAX_ACTIVATING_TOKENS are specific things/concepts (SNPs, argument, bribes, species, ranking, segregation, analysis, variable, justified, label, part, sequence).TOKENS_AFTER_MAX_ACTIVATING_TOKEN are functional words (by, because, with, for, if).TOP_POSITIVE_LOGITS are about singularity or reason (only, because, mainly).The neuron is likely firing for specific terms *when* a reason, cause, or condition is provided. Words like "because", "if", "by" are cues for this.The 'only' logits might be related to *how* a specific term is defined or used, emphasizing specificity or a unique reason.How about: "specific [term] plus reason/condition"?"specific terms explained by cause""terms following cause or condition"Let's re-examine some texts with the max tokens:- "seasonal SNPs **by** chromosome" -> SNP (term) + by (condition/method)- "‘particle’ label **because** of their characteristics" -> label (term) + because (reason)- "shooting is ruled to have been justified" -> justified (term) - this one doesn't have a following word from the list, but the word "explained" is part of the concept.- "random variable is a discrete random variable **if** for a set of finite or countably" -> variable (term) + if (condition)- "paid the bribes **through** an agent" -> bribes (term) + through (method)- "ionic species and BiTe^+^. The cesium forms that kind of ionic species **with** Te and Bi" -> species (term) + with (relation)- "earned its No. 5 ranking **by** making shots" -> ranking (term) + by (reason/method)- "a sequence... **if** and only **if**" -> sequence (term) + if (condition)- "targeted analysis **in** multiple cases" -> analysis (term) + in (context)- "disciplinary segregation **is** an inmate" -> segregation (term) + is (state) - this one is a definition.The pattern seems to be specific terms being defined, classified, qualified, or contextualized. The TOP_POSITIVE_LOGITS "only", "because", "mainly" suggest a focus on the *reason* or *specific condition* under which these terms apply.Let's consider "terms followed by explanation". Too vague."explanation for technical terms"."technical terms with justification"The key seems to be that the MAX_ACTIVATING_TOKENS are specific concepts, and they are often followed by words that *explain* or *qualify* them. The top logits reinforce the idea of "why" or "solely".Perhaps focusing on the *relationship* implied."terms and their conditions""terms and reasons""classification by reason"The phrase needs to be specific and avoid forbidden starters."terms followed by justification/reason" -> could be a good direction."reason for classification""condition for term"Let's try to be very direct. The neuron connects a term to its reason/condition.Looking at the logits again: "only", "because", "numai", "uniquement", "porque", "потому". These strongly suggest *reason* or *exclusivity*.Max activating tokens are nouns/adjectives representing entities or states.How about: "terms explained by cause or condition"?Length: 6 words.Covers the idea well because the MAX_ACTIVATING_TOKENS are the "terms" and "TOKENS_AFTER_MAX_ACTIVATING_TOKEN" often signify "cause or condition" (e.g., because, if, by), and "TOP_POSITIVE_LOGITS" reinforce "cause" ("because") or "exclusivity/specificity" ("only").Let's check against rules:- Concise (3-20 words): Yes (6 words).- Captures what neuron detects/predicts by finding patterns: Yes.- No forbidden starters: Yes.- No "tokens" or "patterns": Yes.- Specific: Yes, 'terms' and 'cause or condition' feels specific.- Majority match: Majority of texts seem to fit this.Consider alternatives:- "terms followed by reasons" (

    np_acts-logits-general · gemini-2.5-flash-lite
    New Auto-Interp
    Top Features by Cosine Similarity
    Comparing With GEMMA-2-2B @ 16-gemmascope-res-16k
    Configuration
    google/gemma-scope-2b-pt-res/layer_16/width_16k/average_l0_78
    Prompts (Dashboard)
    36,864 prompts, 128 tokens each
    Dataset (Dashboard)
    monology/pile-uncopyrighted
    Features
    16,384
    Data Type
    float32
    Hook Name
    blocks.16.hook_resid_post
    Hook Layer
    16
    Architecture
    jumprelu
    Context Size
    1,024
    Dataset
    monology/pile-uncopyrighted
    Activation Function
    relu
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
    SourceChecksum
    -0.77
    ]<<"
    -0.67
    ')(
    -0.63
    ">//
    -0.63
    ☸
    -0.59
    encodeWith
    -0.59
    fjspx
    -0.57
    存于互联网档案馆
    -0.57
    ERTA
    -0.57
     Vang
    -0.56
    POSITIVE LOGITS
     only
    0.82
     because
    0.75
     потому
    0.72
     numai
    0.58
    ONLY
    0.57
     porque
    0.57
    only
    0.56
     uniquement
    0.55
     ONLY
    0.54
     mainly
    0.54
    Activations Density 1.134%

    No Known Activations