© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    Natural Language
    Autoencoders
    NEW
    Assistant AxisNEWCircuit TracerUPDATESteerSAE EvalsExportsAPI Community BlogPrivacy & TermsContact
    1. Home
    2. Gemma-3-1B
    3. 22-GEMMASCOPE-2-RES-16K
    4. 9701
    Prev
    Next
    INDEX
    Explanations

    word counts like 100-150 words, 200-250 words, 500 words, 7500-9000 words, 700-800 words. The phrase "Approx. X-Y words" or "(Approx. X words)" or "X-Y words" appears frequently in the text snippets. The `MAX_ACTIVATING_TOKENS` are numerical, and `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` are all `0`, suggesting these numbers within the text could be what triggers the neuron. The `TOP_POSITIVE_LOGITS` list contains words from various languages, which don't seem to connect directly to the text snippets, but the text snippets show a strong pattern.Therefore, a possible explanation is that the neuron detects mentions of word counts or approximate word counts.Let's refine:"Approximate word counts" - 2 words."word count approximations" - 3 words."mentions of word counts" - 4 words."specifying word counts" - 3 words."approximate word count specifications" - 4 words.The `TOP_POSITIVE_LOGITS` could be a distraction or represent a secondary, less prominent behavior. The primary, most evident pattern is in the `TOP_ACTIVATING_TEXTS`.Considering the rules:- Concise (3-20 words): Yes.- Single word or phrase, not a full sentence: Yes.- About tokens following or preceding (implicitly here, the numbers are tokens representing counts): Yes.- Avoid listing all tokens: Yes.- No phrases like "words related to": Yes.- No "This neuron detects/predicts": Yes.- No "tokens" or "patterns": Yes.- Don't capitalize unless proper noun: Yes.- Specific: Yes, it's about word counts.The `MAX_ACTIVATING_TOKENS` values (5, 5, 0, 5, 8, 2, 8, 8, 8, 4) could correspond to tokens that represent numbers or parts of numbers seen in these word count phrases. For example, "100-150 words" might break down into tokens, and these numbers could be the high-activation ones. However, without the actual tokenization, it's hard to say for sure. The `0` after each `MAX_ACTIVATING_TOKEN` might signify the end of a phrase or sentence fragment.The most consistent and striking pattern is the explicit mentioning of word counts.Let's try to be as direct as possible, focusing on the content it's reacting to."approximate word counts" - This is a good candidate.What if the numbers themselves relate to the *length* of the text being described?"100-150 words""200-250 words""500 words""7500-9000 words""700-800 words""300-400 words"The neuron seems to activate when text specifies approximate word counts for sections, documents, or options.Final choice: "approximate word counts" or "word count specifications". "approximate word counts" is more descriptive of what's seen.Let's check the `TOP_POSITIVE_LOGITS` again:`अंसारी`, `ல்லை`, `පළ`, `আত্ম`, `இயற்கை`, `অতি`, `Neder`, `महंत`, `වි`, `ateľ`These are mainly foreign script words. This is very odd. Do these languages sometimes use numbers in a specific way? Or perhaps the neuron is *multi-talented*, detecting both foreign words *and* structural text elements like word counts?If I have to pick *one* behavior, the word counts are much more prominent and consistently represented in the `TOP_ACTIVATING_TEXTS`. The foreign words are only in `TOP_POSITIVE_LOGITS`.Let's re-evaluate the prompt: "Your response should be a concise explanation (3 to 20 words) that captures what the neuron detects or predicts by finding patterns in lists."The pattern of explicit word count specifications is very strong in the `TOP_ACTIVATING_TEXTS`. The `MAX_ACTIVATING_TOKENS` are numbers. The `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` being `0` implies these numbers are often terminal or followed by very little.Could the neuron be detecting numbers followed by "words"?Let's combine the lists.`MAX_ACTIVATING_TOKENS`: 5, 5, 0, 5, 8, 2, 8, 8, 8, 4`TOKENS_AFTER_MAX_ACTIVATING_TOKEN`: All 0`TOP_ACTIVATING_TEXTS`: contain phrases like "Approx. 100-150 words", "Approx. 200-250 words", "7500-9000 words", "Approx. 500 words".`TOP_POSITIVE_LOGITS`: are foreign language words.It's possible the neuron detects *numbers that appear in a context of word count specifications*.Given the strong presence of word counts in `TOP_ACTIVATING_TEXT

    np_acts-logits-general · gemini-2.5-flash-lite
    New Auto-Interp
    Top Features by Cosine Similarity
    Configuration
    google/gemma-scope-2-1b-pt/resid_post/layer_22_width_16k_l0_medium
    Prompts (Dashboard)
    392,802 prompts, 256 tokens each
    Dataset (Dashboard)
    monology/pile-uncopyrighted
    No Configuration Found
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
    sdag
    1.02
    られない
    0.98
     Vanderbilt
    0.96
     đừng
    0.94
    なくても
    0.91
    せない
    0.90
     freshman
    0.90
    懶
    0.90
     Брита
    0.89
     gravitational
    0.89
    POSITIVE LOGITS
    strong
    1.08
     jiwa
    1.06
     шт
    1.06
    ান্ড
    1.06
     души
    1.02
     adet
    1.02
     unità
    1.02
     strong
    0.99
    entries
    0.99
     fő
    0.98
    Activations Density 0.057%

    No Known Activations