In This Edition

🦙 Llama 3.3 70B Instruct - Inference, dashboards, autointerp now available with the Goodfire SAE.
- ➡️ Example Cookbook
- ➡️ Release Page
⏰ Temporal Feature Analysis - A new, interactive visualization on Neuronpedia.
- ➡️ Try Demo Interface
- Release + Paper + Notebook + Tweet
🤖 gpt-oss-20b - Dashboards + autointerp + inference. Uses Andy Ardidi's SAEs + MAEs.
- ➡️ Release
😈 Misaligned Persona in Qwen+Llama - Arditi and Chen's reproduction of Misaligned Persona.
- ➡️ Release
- Example "Nonsense/Absurdity" Steering
🔬 Research using Neuronpedia: Automated Circuit Interpretation via Probe Prompting
- Preprint | LW Post | Interactive Demo | Github Repo
📈 Attribution Graph Updates
🏷️ Gemma Scope Auto-Interp Refresh + 27B
🎁 Other Updates

🔊 Summary by NotebookLM - The Babble on [Apple] [Spotify]

🦙 Llama 3.3 70B Instruct

By popular demand, we're supporting our biggest model yet, just in time for your ICLR rebuttals! These have initial support for Goodfire's SAE.

How to Use
- Use the Neuronpedia package: pip install neuronpedia
  - ➡️ Example Cookbook (Recommended!)
- Or use the API with parameters:
  - modelId = "llama3.3-70b-it" | source/layer = "50-resid-post-gf" | sourceSet = "resid-post-gf"
Status
- ✅ Supported Functionality (See Cookbook)
  - Dashboards + Autointerp Labels ➡️ Release Page
  - Activations for a single feature with custom prompt (API)
  - Top activating features from all features with custom prompt (API)
  - Top activating features by token from all features with custom prompt (API)
  - NEW: Batch Inference - send up to 4 prompts (256 tokens max per prompt) in one request (use array of prompts in API)
  - Steering is now enabled as of Nov 22nd
  - All other APIs are now available - Search via Inference, TopK by Token, etc.
📫 Feedback/Requests - What else do you want this API to do?

⏰ Temporal Feature Analysis (Lubana, Rager, Hindupur, et al.)

➡️ Open Demo Interface | Release | Tweet

Background - Temporal Feature Analysis (paper, notebook) is an interpretability method designed to capture both contextual and local information in language model representations.
✅ Supported Functionality
- Similarity Matrix Visualization
  - Custom text (example) or use a demo preset (example)
  - Hover over tokens or cells to see the corresponding cell or token
  - Link directly to your custom similarity matrix by clicking "Share"
- Supported Models
  - Gemma 2 2B IT - Layer 12
  - Llama 3.1 8B Instruct - Layer 15, Layer 26
- Steering Enabled

🤖 gpt-oss-20b

To kick off support for gpt-oss-20b, we have BatchTopK SAEs by Andy Arditi.

Release ➡️ neuronpedia.org/gpt-oss-sae
- Dashboards (chat + non-chat MAEs) + autointerp labels
- These are the trainer_0 in the Huggingface repo
Status
- ✅ Most API calls supported, including inference
- ❌ Steering (ETA: Nov 14th - 15th)
- NEW: Dashboards now have a new toggle: "Show Raw Tokens" (displays special tokens) vs "Show Formatted" (formats special tokens into role and content).

😈 Misaligned Persona in Open Weight Models

Arditi and Chen's reproduction of Misaligned Persona, for Qwen and Llama. [LessWrong Post]

Release ➡️ neuronpedia.org/misaligned-persona
- Dashboards (chat + non-chat MAEs) + autointerp labels
Supported Models
- Llama 3.1 8B Instruct (Inference + Steering Enabled) - Layers 3, 7, 11, 15, 19, 23, 27
- Qwen 2.5 7B Instruct - Layers 3, 7, 11, 15, 19, 23, 27
Steering Examples - Llama 3.1 8B
- Nonsense/Absurdity
- Opposite/Disobedience

🔬 Research using Neuronpedia: Automated Circuit Interpretation via Probe Prompting

Giuseppe Birardi

Preprint | LW Post | Interactive Demo | Github Repo

Attribution Graph Probing automates feature interpretation with probe-prompting, turning CLT attribution graphs into compact subgraphs of concept-aligned supernodes validated via Neuronpedia Replacement/Completeness (≈0.54/0.83). On US-capitals circuits it beats geometric clustering on interpretability and reveals an early–vs–late layer split (early layers generalize; ; late Say-X specialize).

The demo is mostly push-button, and the repo makes it easy to scale to many graphs.

Highlighted automated subgraphs:

Dallas → Austin subgraph

Muscle → Diaphragm subgraph

📈 Attribution Graph Updates

NEW: Zoom and pan the subgraph using the buttons on the bottom left, or pinching / click and drag
NEW: Graph and subgraph replacement and completeness scores, at the bottom of the subgraph
NEW: Qwen3-4B - Steering from graph now enabled
NEW: "Remix" Graph
- Click "Remix" to edit the current prompt for a graph and generate a new graph from it.
- For Gemma 2 2B, you can also switch between using the Gemma Scope Transcoders and the Anthropic Fellows' Cross Layer Transcoders to quickly compare the difference.
NEW: Download the labels along with the graph - click Info, then "Download Graph JSON + Labels"
NEW: Simplified UI for graph generation (click "Advanced" for other adjustments)

🏷️ Gemma Scope Auto-Interp Label Refresh + 27B Dashboards

NEW: Added dashboards for 3 layers of Gemma 2 27B

REFRESH: We've refreshed many Gemma Scope autointerp labels, using a new General Explainer designed/tested with thinking models (like Gemini 2.5 Flash) for multiple models, layers, and widths:

Model	Hook and Width	Layers	Example
gemma-2-27b (New!)	gemmascope-res-131k (New!)	10, 22, 34	Link
gemma-2-2b	gemmascope-res-16k	16, 18, 20, 22, 24	Link
gemma-2-2b	gemmascope-res-65k	16, 18, 20, 22, 24	Link
gemma-2-9b	gemmascope-res-16k	24, 26, 28, 30, 32	Link
gemma-2-9b	gemmascope-res-131k	24, 26, 28, 30, 32	Link
gemma-2-9b-it	gemmascope-res-16k	9, 20, 31	Link
gemma-2-9b-it	gemmascope-res-131k	9, 20, 31	Link

As with all data on Neuronpedia, exports including dashboards, labels, etc are available publicly.

🎁 Other Updates

There's now an Available Resources page to see what models and sources are available for inference.
You may have noticed this post is "bullet point"-heavy, with fewer graphics and paragraphs. We hope this helps us write update posts more efficiently so that we can update more frequently rather than a ton of updates in one post.
It was Neel Nanda's birthday on Monday - our thanks for his awesome work and support.
I guess we did a lot of auto-interp:

As always, please contact us with your questions, feedback, and suggestions.

The Residual Stream

Neuronpedia's Blog

The Babble

Podcast by NotebookLM

Llama 3.3 70B, Temporal Feature Analysis, Featured Research, gpt-oss-20b, and a lot more

In This Edition

🦙 Llama 3.3 70B Instruct

⏰ Temporal Feature Analysis (Lubana, Rager, Hindupur, et al.)

🤖 gpt-oss-20b

😈 Misaligned Persona in Open Weight Models

🔬 Research using Neuronpedia: Automated Circuit Interpretation via Probe Prompting

📈 Attribution Graph Updates

🏷️ Gemma Scope Auto-Interp Label Refresh + 27B Dashboards

🎁 Other Updates

The Residual Stream

Neuronpedia's Blog

The Babble

Podcast by NotebookLM

Llama 3.3 70B, Temporal Feature Analysis, Featured Research, gpt-oss-20b, and *a lot more*

In This Edition

🦙 Llama 3.3 70B Instruct

⏰ Temporal Feature Analysis (Lubana, Rager, Hindupur, et al.)

🤖 gpt-oss-20b

😈 Misaligned Persona in Open Weight Models

🔬 Research using Neuronpedia: Automated Circuit Interpretation via Probe Prompting

📈 Attribution Graph Updates

🏷️ Gemma Scope Auto-Interp Label Refresh + 27B Dashboards

🎁 Other Updates

Llama 3.3 70B, Temporal Feature Analysis, Featured Research, gpt-oss-20b, and a lot more