Neuronpedia

EXPLANATION TYPE

np_max-act-logits

Description

Attempts to replicate Anthropic's autointerp used for their attribution graphs paper's features.

Author

Neuronpedia

URL

Settings

Activations shown = 24 tokens around max act. Shows top 10 logits. Shows model the max activating token too. Uses top 10 deduplicated activations.

Recent Explanations

word/acronym fragments

gemini-2.5-flash

Boy"\n2. "Sims"\n3. "