INDEX

Explanations

references to organizational structures and categorization in various contexts

oai_token-act-pair · gpt-4o-mini Triggered by @bot

units of time (weeks, decade, Days, monthly)2. geographical entities (cities, country, state, countries, region)3. categories (Category)4. body parts (shoulder)5. biological entities (animals)Many of these are followed by punctuation or conjunctions, common in lists.The explanation should be concise and capture this broad category."units, locations, and categories" - 4 words."time, places, categories, and body parts" - 6 words."units, regions, and time" - 4 words."categories, locations, and time units" - 5 words.Let's re-examine the top logits: scrapy, Administrativna, AttributeSet, bootstrapcdn, Generales, paksa, 全体, ganzes, RotationOrder, myſelf. These are mostly technical or foreign language words. This hints that the neuron might be activating on specific contexts where these words appear, and these contexts happen to contain the types of words in MAX_ACTIVATING_TOKENS.However, the primary job is to explain what the neuron detects by finding patterns in lists. The lists most directly showing a pattern are MAX_ACTIVATING_TOKENS and TOP_ACTIVATING_TEXTS. The foreign/technical words in TOP_POSITIVE_LOGITS might be where these patterns occur, but the pattern itself is the types of words.Let's categorize the MAX_ACTIVATING_TOKENS:- Time units/periods: weeks, decade, monthly, Days- Geographical locations/divisions: cities, country, state, countries, region- Biological/organic entities: animals- Abstract concepts/classifications: Category- Body parts: shoulderThe most coherent pattern description would encompass these."units, locations, categories, and time" seems good."locations, time units, and categories""geographical areas, time units, and classifications""units of time, places, and entities"The TOP_POSITIVE_LOGITS are quite diverse and don't immediately suggest a single common theme with the other lists. They might represent specific names or technical jargon where these categories appear. So, I should focus on the pattern from the MAX_ACTIVATING_TOKENS and TOP_ACTIVATING_TEXTS."units, locations, and entities" is 4 words."locations, time units, and classifications" is 5 words."places, times, and categories" is 4 words.Looking at the TOP_ACTIVATING_TEXTS again:- cities, governments- animals, humans- country (region)- days, weeks, months, years- decade- state- countries- monthly- region, state (cities)- shoulder- DaysThe dominant themes are clearly geographical/political divisions, time measurements, and some biological/categorical terms. locations, time units, and categories

np_acts-logits-general · gemini-2.5-flash-lite

New Auto-Interp

Top Features by Cosine Similarity

Comparing With GEMMA-2-9B-IT @ 20-gemmascope-res-131k

Configuration

google/gemma-scope-9b-it-res/layer_20/width_131k/average_l0_81

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Features

131,072

Data Type

float32

Hook Name

blocks.20.hook_resid_post

Hook Layer

Architecture

jumprelu

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Activation Function

relu

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

__(/*!

-0.45

WriteLiteral

-0.36

gut

-0.35

 IBOutlet

-0.35

CppCodeGen

-0.35

Sla

-0.34

 cleave

-0.34

abstractmethod

-0.33

msgTypes

-0.33

󠁴

-0.33

POSITIVE LOGITS

scrapy

0.52

 Administrativna

0.52

 AttributeSet

0.52

bootstrapcdn

0.52

 Generales

0.50

paksa

0.50

全体

0.49

 ganzes

0.49

RotationOrder

0.48

 myſelf

0.48

Activations Density 0.091%

references to organizational structures and categorization in various contexts

No Comments

No Known Activations

references to organizational structures and categorization in various contexts

No Comments

No Known Activations