Happy new year! Two big collaborations in this one.

Assistant Axis (Anthropic) is pretty interesting. Try to red-team it and watch the capped model resist drifting to harmful behaviors or sycophancy - it's not unbreakable of course, but I'm curious how easily it can be done. Warning that the "Isolation" example has a reference to self-harm, so please avoid that if you're sensitive to it.

Gemma Scope 2 (Google DeepMind) is the sequel to the original Gemma Scope, now for Gemma 3. We have many of the dashboards and auto-interp labels up, but unfortunately we're still finishing this up for some models. We expect this to be done by Feb 14th, and we'll announce this again in the next blog post.

We also feature initiatives by folks who do the real work behind the scenes to make Neuronpedia possible. David Chanin is our SAE whisperer, Michael Hanna is future-proofing circuit tracing by extending it to any Transformers model, and Jaden Fiotto-Kaufman and Clément Dumas' work on NNsight/NNterp powers many of our new inference backends. Our thanks to them all for dedicating their time and resources.

Finally, it'll soon be Neuronpedia's 1000th day of existence. We'll be doing something cool for it. Stay tuned!

In This Edition

🤵🏻 Assistant Axis (Anthropic) ➡️ Launch - Monitoring and capping to the assistant persona.
🔬 Gemma Scope 2 (Google DeepMind) ➡️ Launch - Gemma Scope for Gemma 3. WIP, ETA Feb 14th.
🔍 SAELens (David Chanin) ➡️ GitHub - Toy SAEs, Matching Pursuit, Matryoshka, JumpReLU SAEs, v6.
⚡️ Circuit Tracer (Michael Hanna) ➡️ GitHub - Support for any Transformer model, tests++.
👁️ NNsight (Jaden Fiotto-Kaufman) ➡️ GitHub - Neuronpedia now uses nnsight for many models.

🔊 Summary by NotebookLM - The Babble on [Apple] [Spotify]

🤵🏻 Assistant Axis

It's 10PM. Do you know if your model's drifting? [Example]

In collaboration with Anthropic, our interactive demo lets you visualize the difference between the default and "activation capped" Llama 70B, which is better at producing less harmful (trigger warning: self-harm), sycophantic, and jailbroken responses. Like any other Neuronpedia steering model, you can also use your own custom chats to red-team it and share the results.

➡️ Launch Demo - Compare example conversations, or use your own custom prompts.
Anthropic Blog, Paper (Lu et al), and GitHub
Vector Dashboard

🔬 Gemma Scope 2 (ETA Feb 14)

Gemma Scope 2 feature in gemma-3-27b-it about AI. [Example]

In collaboration with Google DeepMind, our support for this sequel to Gemma Scope 1 is focused on all sizes of Gemma 3. While the weights have been finalized, we're still finishing uploading dashboards and autointerp labels. We'll include this in the next newsletter as well, after everything's complete (ETA Feb 14th).

➡️ Gemma Scope 2 Demo - Notable safety-relevant features and circuit tracing - we recommend starting with the original Gemma Scope 1 Demo if you haven't seen it yet.
Browse & Search - Browse and search dashboards for 64+ million SAEs and transcoder latents across 10 Gemma 3 models, including both pretrained and instruct.
Google DeepMind Blog, HuggingFace weights, and Tutorial Notebook

🔍 SAELens

SAELens is Decode's library for training and using Sparse Autoencoders, actively developed by David Chanin. Since SAELens and Neuronpedia are the same organization, and Neuronpedia uses SAELens heavily, we'll now also cover SAELens updates in this blog, starting with these:

Training Toy SAEs on Synthetic Data - Docs and Colab
Matching Pursuit SAEs - Docs, Post, and Paper
Matryoshka SAEs
Anthropic's JumpReLU SAEs - Docs and Post
Major Version v6 - Summary + Migration Doc

⚡️ Circuit Tracer

circuit-tracer is Anthropic's open sourced library for generating and pruning attribution graphs. Thanks to the excellent work by Michael Hanna, Jaden Fiotto-Kaufman, and David Chanin for the v0.3.1 release, Neuronpedia can now do Gemma 3 circuit tracing. Details on the new version:

Use any Transformers model - circuit-tracer is no longer limited to select models - with the new nnsight engine support, you can now generate graphs for any model, as long as you create a simple mapping (instructions here).
Significantly improved test suite - Dozens of new tests ensure that attributions, interventions, and the new engine works as expected.

👁️ NNsight

nnsight is a library for interpreting and intervening on models, created by NDIF. Neuronpedia is now using nnsight for several models, including gpt-oss, Gemma 3, and Llama 3.3 70B.

We're grateful to Jaden Fiotto-Kaufman and David Bau for their support and tireless work on nnsight, and also to Clement Dumas for his terrific work on nnterp, which makes it easy to access the the hook points of different models. We look forward to working more deeply with them in the future.

As always, please contact us with your questions, feedback, and suggestions.

Happy new year! Two big collaborations in this one.

Finally, it'll soon be Neuronpedia's 1000th day of existence. We'll be doing something cool for it. Stay tuned!

In This Edition

🤵🏻 Assistant Axis (Anthropic) ➡️ Launch - Monitoring and capping to the assistant persona.
🔬 Gemma Scope 2 (Google DeepMind) ➡️ Launch - Gemma Scope for Gemma 3. WIP, ETA Feb 14th.
🔍 SAELens (David Chanin) ➡️ GitHub - Toy SAEs, Matching Pursuit, Matryoshka, JumpReLU SAEs, v6.
⚡️ Circuit Tracer (Michael Hanna) ➡️ GitHub - Support for any Transformer model, tests++.
👁️ NNsight (Jaden Fiotto-Kaufman) ➡️ GitHub - Neuronpedia now uses nnsight for many models.

🔊 Summary by NotebookLM - The Babble on [Apple] [Spotify]

🤵🏻 Assistant Axis

It's 10PM. Do you know if your model's drifting? [Example]

➡️ Launch Demo - Compare example conversations, or use your own custom prompts.
Anthropic Blog, Paper (Lu et al), and GitHub
Vector Dashboard

🔬 Gemma Scope 2 (ETA Feb 14)

Gemma Scope 2 feature in gemma-3-27b-it about AI. [Example]

➡️ Gemma Scope 2 Demo - Notable safety-relevant features and circuit tracing - we recommend starting with the original Gemma Scope 1 Demo if you haven't seen it yet.
Browse & Search - Browse and search dashboards for 64+ million SAEs and transcoder latents across 10 Gemma 3 models, including both pretrained and instruct.
Google DeepMind Blog, HuggingFace weights, and Tutorial Notebook

🔍 SAELens

Training Toy SAEs on Synthetic Data - Docs and Colab
Matching Pursuit SAEs - Docs, Post, and Paper
Matryoshka SAEs
Anthropic's JumpReLU SAEs - Docs and Post
Major Version v6 - Summary + Migration Doc

⚡️ Circuit Tracer

Use any Transformers model - circuit-tracer is no longer limited to select models - with the new nnsight engine support, you can now generate graphs for any model, as long as you create a simple mapping (instructions here).
Significantly improved test suite - Dozens of new tests ensure that attributions, interventions, and the new engine works as expected.

👁️ NNsight

nnsight is a library for interpreting and intervening on models, created by NDIF. Neuronpedia is now using nnsight for several models, including gpt-oss, Gemma 3, and Llama 3.3 70B.

As always, please contact us with your questions, feedback, and suggestions.

The Residual Stream

Neuronpedia Blog

The Babble (Deprecated)

Podcast by NotebookLM

Assistant Axis, Gemma Scope 2, SAELens, New Circuit Tracing, and NNsight

In This Edition

🤵🏻 Assistant Axis

🔬 Gemma Scope 2 (ETA Feb 14)

🔍 SAELens

⚡️ Circuit Tracer

👁️ NNsight

The Residual Stream

Neuronpedia Blog

The Babble (Deprecated)

Podcast by NotebookLM

Assistant Axis, Gemma Scope 2, SAELens, New Circuit Tracing, and NNsight

In This Edition

🤵🏻 Assistant Axis

🔬 Gemma Scope 2 (ETA Feb 14)

🔍 SAELens

⚡️ Circuit Tracer

👁️ NNsight