© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    Natural Language
    Autoencoders
    NEW
    Assistant AxisNEWCircuit TracerUPDATESteerSAE EvalsExportsAPI Community BlogPrivacy & TermsContact
    1. Home
    2. Gemma-3-12B
    3. 24-GEMMASCOPE-2-RES-16K
    4. 12461
    Prev
    Next
    INDEX
    Explanations

    The explanation should be a concise phrase (3-20 words) describing what the neuron detects or predicts by finding patterns in the lists. I need to look for common themes or structures.Let's analyze the input:* **MAX\_ACTIVATING\_TOKENS**: `some`, `rating`* **TOKENS\_AFTER\_MAX\_ACTIVATING\_TOKEN**: `key` (after `some`), `difference` (after `rating`)* **TOP\_POSITIVE\_LOGITS**: `Solved`, `SNL`, `ваше`, `法国`, `ገኙ`, `ック`, `गिरि`, `呷`, `फ़`, `âtel`. This list is very diverse and might indicate the neuron responds to a broad range of inputs or specific technical terms, but it's not immediately clear how it relates to the other lists.* **TOP\_ACTIVATING\_TEXTS**: * Mathematical calculations (`200 x 9`, `23^2`, `312 x 3`, `12 x 17`, `88 x 88`, `366 x 10`). * Discussion about elections (`US presidential election`, `major candidates`, `key issues`). * Explanation of "rating difference" and "expected score".**Finding the Pattern:**1. **`some` + `key`**: The text snippet "some key issues" connects `some` from `MAX_ACTIVATING_TOKENS` with `key` from `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`. This suggests a focus on "key issues" or important points.2. **`rating` + `difference`**: The text snippet explaining "expected score depends *only* on the rating difference" connects `rating` with `difference`. This is a clear semantic link related to scoring or valuation based on differences.3. **Mathematical Context**: The math examples strongly suggest the neuron is involved in breaking down problems, calculations, or logical steps.4. **Election Context**: The election snippet suggests it might identify specific topics or categories within a discourse.**Synthesizing the Pattern:**The neuron seems to be activated by specific entities (`rating`, `some`) followed by terms that often elaborate on them (`difference`, `key`). The examples lean towards analytical tasks: breaking down calculations or understanding the significance of differences in ratings or key issues.The term "key issues" points towards identifying significant topics. The "rating difference" points towards understanding comparative values or scores. Many of the `TOP_POSITIVE_LOGITS` are not in English, which is a characteristic of models that have been trained on multilingual data, or potentially that the neuron specializes in specific non-English contexts or technical terms, but this is less about *pattern* and more about *scope*.The most consistent pattern linking `MAX_ACTIVATING_TOKENS`, `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`, and `TOP_ACTIVATING_TEXTS` involves identifying significant items or points. "Key issues" is very direct. "Rating difference" implies a focus on the *difference* between ratings, which is a specific characteristic.Looking at `TOP_POSITIVE_LOGITS` again: `Solved`, `difference` (if it were there, but it's not a logit), `key` (if it were there). The mathematical nature of `Solved` is interesting. The election context is about "key issues". The rating context is about "difference".Let's try to combine these:- "key issues"- "rating difference"- Mathematical problem decomposition (suggests identifying crucial components or steps).A unifying theme could be identifying crucial components or distinctions."key issues" -> crucial topics"rating difference" -> crucial distinctionWhat about the numbers and math examples? They are often about *steps* in a calculation or *breaking down* a problem. This also points to identifying components.Consider the combination of `some` `key` and `rating` `difference`.The neuron might be looking for specific types of entities and their attributes or differences.Let's re-examine `TOP_POSITIVE_LOGITS`. If we ignore language and focus on conceptual similarity:`Solved` (a successful outcome after steps)`difference` (conceptually related to comparison, analysis)`key` (conceptually related to importance, central points)The texts show:- Breaking down math problems (`(20 + 3) * (20 + 3)`, `(300 x 3) + (10 x 3) + (2 x 3)`) - identifying components/steps.- "key issues" - identifying important topics.- "rating difference" - identifying a specific comparative value.The pattern seems to be about identifying specific, consequential aspects or components within a context.How to phrase this concisely?- "important distinctions"- "key differences"- "components and distinctions"- "specific consequential aspects"Looking at the prompt's constraints:- 3 to 20 words.- Find patterns.- Specific.- Could be about tokens following/preceding.- Avoid "tokens", "patterns".- Don't start with "This neuron detects".- Do not capitalize first letter unless proper noun.Let's consider the "key issues" and "rating difference" together. They are both about identifying important aspects or comparisons.Could it be about "essential components"?- Math decomposition: essential steps/components.- Key issues: essential topics.- Rating difference: essential comparative value.Let's look at the interaction:`some` -> `key``rating` -> `difference`This hints at identifying an entity (`some` thing, or a `rating`) and then its relevant characteristic (`key` attribute, or `difference`).What if the pattern is more structural like `word` + `word` where the second word modifies or specifies the first in a particular way?- `some` `key` (adjective + noun)- `rating` `difference` (noun + noun, where first noun modifies second)The most common conceptual theme is identifying something *specific* or *important* or *different*.Consider the combination of `key` issues and `rating` `difference`. This points towards identifying critical distinctions or specific notable aspects.Let's try to make it very specific.The phrase must capture what the neuron *detects or predicts*.If we look at `MAX_ACTIVATING_TOKENS` as `some` and `rating`, and `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` as `key` and `difference`.The neuron is looking for phrases like "key [noun]" or "[noun] difference".The `TOP_ACTIVATING_TEXTS` provide examples of this: "some key issues", "rating difference".The math examples also show breaking down into components, which is a form of identifying distinguishing parts.So, the neuron seems to be specifically sensitive to identifying key concepts and distinctions.Let's refine:"key components and differences" - 3 words.This captures "key issues" and "rating difference" and the breakdown of math problems (components).Let's check the `TOP_POSITIVE_LOGITS` list. `Solved` is a result. `difference` and `key` are not directly there but are strongly implied by the other lists. The foreign words are problematic for a simple conceptual explanation.However, the rule is to find patterns in lists and explain what it detects/predicts. The most salient *predictive pattern* is the structure and

    np_acts-logits-general · gemini-2.5-flash-lite

    The neuron strongly activates on tokens appearing in dense, data‐heavy or enumerated passages—e.g. specifications, lists, citations, price or measurement entries, and similar structured or tabulated text.

    oai_token-act-pair · o4-miniTriggered by @jyhe0408
    New Auto-Interp
    Top Features by Cosine Similarity
    Configuration
    google/gemma-scope-2-12b-pt/resid_post/layer_24_width_16k_l0_medium
    Prompts (Dashboard)
    392,802 prompts, 256 tokens each
    Dataset (Dashboard)
    monology/pile-uncopyrighted
    No Configuration Found
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
    sama
    0.96
    schaft
    0.91
    ska
    0.85
    ;
    0.84
     ruch
    0.83
    ses
    0.82
    !
    0.80
     slander
    0.79
    }
    0.77
     byla
    0.77
    POSITIVE LOGITS
    Solved
    0.77
    的情况下
    0.76
     задержа
    0.76
     दलित
    0.75
    片段
    0.74
     गिरि
    0.74
     फ़
    0.74
    вате
    0.73
    SNL
    0.73
     रिस
    0.73
    Activations Density 0.000%

    No Known Activations