
Llama 3.3 70B, Temporal Feature Analysis, Featured Research, gpt-oss-20b, and *a lot more*
In This Edition
- π¦ Llama 3.3 70B Instruct - Inference, dashboards, autointerp now available with the Goodfire SAE.
- β° Temporal Feature Analysis - A new, interactive visualization on Neuronpedia.
- π€ gpt-oss-20b - Dashboards + autointerp + inference. Uses Andy Ardidi's SAEs + MAEs.
- π Misaligned Persona in Qwen+Llama - Arditi and Chen's reproduction of Misaligned Persona.
- π¬ Research using Neuronpedia: Automated Circuit Interpretation via Probe Prompting
- π Attribution Graph Updates
- π·οΈ Gemma Scope Auto-Interp Refresh + 27B
- π Other Updates
π Summary by NotebookLM - The Babble on [Apple] [Spotify]
π¦ Llama 3.3 70B Instruct
By popular demand, we're supporting our biggest model yet, just in time for your ICLR rebuttals! These have initial support for Goodfire's SAE.
- How to Use
- Use the Neuronpedia package:
pip install neuronpedia- β‘οΈ Example Cookbook (Recommended!)
- Or use the API with parameters:
modelId = "llama3.3-70b-it"|source/layer = "50-resid-post-gf"|sourceSet = "resid-post-gf"
- Use the Neuronpedia package:
- Status
- β
Supported Functionality (See Cookbook)
- Dashboards + Autointerp Labels β‘οΈ Release Page
- Activations for a single feature with custom prompt (API)
- Top activating features from all features with custom prompt (API)
- Top activating features by token from all features with custom prompt (API)
- NEW: Batch Inference - send up to 4 prompts (256 tokens max per prompt) in one request (use array of prompts in API)
- Steering is now enabled as of Nov 22nd
- All other APIs are now available - Search via Inference, TopK by Token, etc.
- β
Supported Functionality (See Cookbook)
- π« Feedback/Requests - What else do you want this API to do?
β° Temporal Feature Analysis (Lubana, Rager, Hindupur, et al.)
β‘οΈ Open Demo Interface | Release | Tweet
- Background - Temporal Feature Analysis (paper, notebook) is an interpretability method designed to capture both contextual and local information in language model representations.
- β
Supported Functionality
- Similarity Matrix Visualization
- Supported Models
- Steering Enabled
π€ gpt-oss-20b
To kick off support for gpt-oss-20b, we have BatchTopK SAEs by Andy Arditi.
- Release β‘οΈ neuronpedia.org/gpt-oss-sae
- Dashboards (chat + non-chat MAEs) + autointerp labels
- These are the
trainer_0in the Huggingface repo
- Status
- β Most API calls supported, including inference
- β Steering (ETA: Nov 14th - 15th)
- NEW: Dashboards now have a new toggle: "Show Raw Tokens" (displays special tokens) vs "Show Formatted" (formats special tokens into
roleand content).
π Misaligned Persona in Open Weight Models
Arditi and Chen's reproduction of Misaligned Persona, for Qwen and Llama. [LessWrong Post]
- Release β‘οΈ neuronpedia.org/misaligned-persona
- Dashboards (chat + non-chat MAEs) + autointerp labels
- Supported Models
- Llama 3.1 8B Instruct (Inference + Steering Enabled) - Layers 3, 7, 11, 15, 19, 23, 27
- Qwen 2.5 7B Instruct - Layers 3, 7, 11, 15, 19, 23, 27
- Steering Examples - Llama 3.1 8B
π¬ Research using Neuronpedia: Automated Circuit Interpretation via Probe Prompting
Giuseppe Birardi
Preprint | LW Post | Interactive Demo | Github Repo
Attribution Graph Probing automates feature interpretation with probe-prompting, turning CLT attribution graphs into compact subgraphs of concept-aligned supernodes validated via Neuronpedia Replacement/Completeness (β0.54/0.83). On US-capitals circuits it beats geometric clustering on interpretability and reveals an earlyβvsβlate layer split (early layers generalize; ; late Say-X specialize).
The demo is mostly push-button, and the repo makes it easy to scale to many graphs.
Highlighted automated subgraphs:
π Attribution Graph Updates
- NEW: Zoom and pan the subgraph using the buttons on the bottom left, or pinching / click and drag
- NEW: Graph and subgraph replacement and completeness scores, at the bottom of the subgraph
- NEW: Qwen3-4B - Steering from graph now enabled
- NEW: "Remix" Graph
- Click "Remix" to edit the current prompt for a graph and generate a new graph from it.
- For Gemma 2 2B, you can also switch between using the Gemma Scope Transcoders and the Anthropic Fellows' Cross Layer Transcoders to quickly compare the difference.
- NEW: Download the labels along with the graph - click Info, then "Download Graph JSON + Labels"
- NEW: Simplified UI for graph generation (click "Advanced" for other adjustments)
π·οΈ Gemma Scope Auto-Interp Label Refresh + 27B Dashboards
-
NEW: Added dashboards for 3 layers of Gemma 2 27B
-
REFRESH: We've refreshed many Gemma Scope autointerp labels, using a new General Explainer designed/tested with thinking models (like Gemini 2.5 Flash) for multiple models, layers, and widths:
Model Hook and Width Layers Example gemma-2-27b (New!) gemmascope-res-131k (New!) 10, 22, 34 Link gemma-2-2b gemmascope-res-16k 16, 18, 20, 22, 24 Link gemma-2-2b gemmascope-res-65k 16, 18, 20, 22, 24 Link gemma-2-9b gemmascope-res-16k 24, 26, 28, 30, 32 Link gemma-2-9b gemmascope-res-131k 24, 26, 28, 30, 32 Link gemma-2-9b-it gemmascope-res-16k 9, 20, 31 Link gemma-2-9b-it gemmascope-res-131k 9, 20, 31 Link
As with all data on Neuronpedia, exports including dashboards, labels, etc are available publicly.
π Other Updates
- There's now an Available Resources page to see what models and sources are available for inference.
- You may have noticed this post is "bullet point"-heavy, with fewer graphics and paragraphs. We hope this helps us write update posts more efficiently so that we can update more frequently rather than a ton of updates in one post.
- It was Neel Nanda's birthday on Monday - our thanks for his awesome work and support.
- I guess we did a lot of auto-interp:

As always, please contact us with your questions, feedback, and suggestions.
