© Neuronpedia 2026
    Privacy & TermsBlogGitHubSlackTwitterContact
    Neuronpedia logo - a computer chip with a rounded viewfinder border around it

    Neuronpedia

    Natural Language
    Autoencoders
    NEW
    Assistant AxisNEWCircuit TracerUPDATESteerSAE EvalsExportsAPI Community BlogPrivacy & TermsContact
    1. Home
    2. Gemma-3-12B
    3. 24-GEMMASCOPE-2-RES-16K
    4. 4529
    Prev
    Next
    INDEX
    Explanations

    I need to find a pattern that connects `MAX_ACTIVATING_TOKENS` (which are all '5' or '0') and `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` (`words`, `lb`, `books`, `distinct`, `diverse`, `0`, `tasks`, `in`, `%`, `sub`). The `TOP_POSITIVE_LOGITS` seem to be non-Latin characters, which might indicate a broader contextual pattern or a specific niche the neuron is sensitive to, but the primary task is to find a pattern in the MAX_ACTIVATING_TOKENS and TOKENS_AFTER_MAX_ACTIVATING_TOKEN.Looking at the `TOP_ACTIVATING_TEXTS`:- "list of 25 words" - The token after '5' (implied by MAX_ACTIVATING_TOKENS) is 'words'.- "25lb in kg" - The token after '5' is 'lb'.- "top 15 books" - The token after '5' is 'books'.- "list of 25 distinct and diverse questions" - The tokens after '5' are 'distinct' and 'diverse'.- "generate 25 diverse sentences" - The token after '5' is 'diverse' (implied).- "generate 1500 rows" - The token after '0' is '0'. This doesn't seem to fit the '5' pattern.The prompt is asking for a concise explanation (3-20 words) of what the neuron detects or predicts by finding patterns in lists. I need to use the provided lists to infer this pattern.1. **MAX_ACTIVATING_TOKENS**: Contains mostly '5', with a few '0's. This suggests the neuron might be related to counts or quantities, often around the number 5 or 0.2. **TOKENS_AFTER_MAX_ACTIVATING_TOKEN**: Contains words like `words`, `lb`, `books`, `distinct`, `diverse`, `tasks`, `sub`. These are types of items or descriptions of lists/quantities.3. **TOP_ACTIVATING_TEXTS**: Shows examples where numbers (like 25, 15, 1500) are followed by requests for lists, specific items, or quantities of things. * "list of 25 words" * "25lb in kg" * "top 15 books" * "list of 25 distinct and diverse questions" * "generate 25 diverse sentences" * "generate 1500 rows" * "output 15 tasks" * "list of 15 in depth mechanics" * "most important 20% of learnings" * "Generate 15 subcategories"Combining these observations:- The neuron fires when numbers (especially related to 5 or 0 counts, suggesting lists or quantities) are mentioned.- The words that follow these numbers are often items being listed or described (words, lb, books, tasks, subcategories, etc.).- The `TOP_POSITIVE_LOGITS` seem to be non-Latin characters, which is a bit of an outlier but might indicate a broad sensitivity to distinct linguistic elements or specific formatting. However, the core pattern seems to revolve around quantities and listing.The most consistent pattern is the request for lists or specific quantities of items, often preceded by a number. The `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` directly names these items.Let's try to formulate a phrase.- "requests for lists and quantities" (5 words) - Good, captures the essence.- "listing items after numbers" (4 words) - Also good.- "numbered lists of items" (4 words) - Captures it well.- "quantities and lists of items" (5 words) - Clear and direct.The phrase "quantities and lists of items" seems to best summarize the observable pattern from the provided lists. The `MAX_ACTIVATING_TOKENS` being 5 reinforces the idea of specific counts or a certain quantity.quantities and lists of items

    np_acts-logits-general · gemini-2.5-flash-lite

    The neuron identifies numeric data tokens—especially numbers, percentages, and decimal values.

    oai_token-act-pair · o4-miniTriggered by @jyhe0408
    New Auto-Interp
    Top Features by Cosine Similarity
    Configuration
    google/gemma-scope-2-12b-pt/resid_post/layer_24_width_16k_l0_medium
    Prompts (Dashboard)
    392,802 prompts, 256 tokens each
    Dataset (Dashboard)
    monology/pile-uncopyrighted
    No Configuration Found
    Embeds
    IFrame
    Link
    Not in Any Lists

    No Comments

    Negative Logits
     wasn
    0.77
    didn
    0.77
     hee
    0.73
     기다
    0.73
     irony
    0.72
    ellido
    0.69
    Рис
    0.68
    ীব্র
    0.68
     weren
    0.67
     imediat
    0.67
    POSITIVE LOGITS
    ו
    0.91
    ارم
    0.88
    ี
    0.87
    ра
    0.86
    าร
    0.85
    াৰ
    0.84
    ISES
    0.84
    りで
    0.83
    า
    0.81
    ล
    0.81
    Activations Density 0.001%

    No Known Activations