In This Edition

🔊 Summary by NotebookLM - The Babble on [Apple] [Spotify]

The Circuit Analysis Research Landscape: A collaborative post on circuit tracing and cross-layer transcoders (CLTs), including research extensions, perspectives, and future directions. Co-authored by Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode. [Read the Post] [Tweets with GIFs] [YouTube Tutorial] [Open Source CLTs] [circuit-tracer v0.2] [clt-training GitHub]
New Graph Features: Neuronpedia's circuit tracer interface now supports steering/interventions, Qwen3 4B graph generation, horizontally scrollable link graphs for long prompts, and other fixes. [Circuit Tracer] [Qwen3 4B Graph]
Fellowship with Neel Nanda (MATS): (All experience levels are welcome - many past scholars have had no past experience.) Learn and publish research under the supervision of Neel Nanda, lead of the mech interp team at Google DeepMind. This full-time fellowship comes with a stipend and provides other financial support including housing, transportation, etc. [Application due August 29th]

Post: The Circuit Analysis Research Landscape

[Read the Post] [Tweets with GIFs] [YouTube Tutorial] [Open Source CLTs] [circuit-tracer v0.2 GitHub] [clt-training GitHub]

Background

An awesome thing about Anthropic's recent circuits/biology release was that they open sourced the tools with Anthropic Fellows Michael Hanna and Mateusz Piotrowski, and also expended significant time helping other interpretability orgs (like us!) to understand and properly implement their work.

The Perks of Collaboration

Of course, the objective of onboarding others to a promising research direction is so that they, too, can contribute to making more overall progress to the science. Today, we report on that progress - the result of a multi-org collaboration by Anthropic, Google DeepMind, Goodfire AI, EleutherAI, and Decode Research - in The Circuit Analysis Research Landscape.

Open source collaboration enabled each organization to contribute their strengths, to efficiently verify and review, and also to optimize time/resources to avoid duplicate efforts. For instance, in section three (Transcoder Architecture and Implementation), four teams of researchers divided up the work of developing, training, optimizing, and benchmarking eleven types of transcoder architectures across multiple models - a breadth that would have been burdensome and surely slower to achieve than with a single team.

We're tremendously optimistic on the potential of future cross-organization collaborations for advancing interpretability, and both grateful and proud to support it on Neuronpedia. Check out the post here.

New Resources

Significant code and artifacts from this post that you can use immediately:

Attribution Graphs for Dummies: A two-part YouTube series that walks you through what attribution graphs are, and how to generate and analyze circuits. Hosted by researchers at Anthropic, Google DeepMind, and Goodfire AI.
circuit-tracer v0.2.0: Cross-layer transcoder (CLT) support, memory improvements, text generation improvements, and better support for new models.
Open Source CLTs: Cross-layer transcoders for Llama 3.2 1B and Gemma 2 2B.
clt-training: A CLT training library by EleutherAI that's optimized for low-GPU environments.

Graph Updates: Steering, Qwen4 4B, UI for Long Prompts

[Circuit Tracer] [Qwen3-4B Graph] [YouTube Steering Demo]

Steering nodes and supernodes (groups of features)

Once you've generated an attribution graph from a prompt, built a subgraph based on your hypotheses of the model's internal reasoning process, and you've grouped features into supernodes, it's time to test your hypotheses and circuits to see if you were right. You can now test by steering both nodes and supernodes directly from the Neuronpedia graph interface!

Here, we negatively steer on two steps of the reasoning process to verify our circuit hypotheses. — Here, we negatively steer on two steps of the 'Dallas state capital' reasoning process to verify our circuit hypotheses.

Using the example of "The capital of the state containing Dallas is" (Austin), our hypothesis is roughly:

Dallas ➡️ Texas ➡️ capital ➡️ Austin

Above, we negatively steer two supernodes separately.

Interrupting the "capital" step: Dallas ➡️ Texas ➡️ ❓ ➡️ Texas
Interrupting the "Texas" step: Dallas ➡️ ❓➡️ capital ➡️ Albany (New York's capital)

In both cases, our steering outputs supported our circuit hypothesis. While this doesn't prove that the circuit hypotheses are correct, it's still pretty strong evidence that we're on the right track.

Steering with Features not in the Graph

Sometimes, you may want to steer with a feature that's not already in the graph. Here, we steer the same Dallas graph with a California feature, which causes the model to output Sacramento as the capital.

Here, we steer using a California feature, to change the output for the capital of Texas to Sacramento.

The steps demonstrated in the above video:

Click "+ Add Feature"
Search for the feature you want (Protip: Select the later layers - you can choose multiple layers.)
Pick the feature to steer and the position it might be most relevant for.
Update the steering strength, and click steer.

Generate Graphs with Qwen3 4B

We're expanding beyond Gemma 2 2B, thanks to the availability of Qwen3 4B transcoders from the Anthropic Fellows. Importantly, Qwen3 4B is a "reasoning"/"thinking" model, so it should be interesting to see how the internal reasoning of the model compares to its "external" reasoning (visible chain of thought tokens).

To use it, simply click "+ New Graph", but be sure to click the dropdown for the model, and select "Qwen3-4B". Here's an example Qwen3 4B graph.

Important Usage Notes & Limitations

Limited to 64 Tokens: We'll soon expand this as soon as we add some memory optimizations (esp from the new circuit-tracer v0.2) and benchmark it, but for now your prompts have a maximum of 64 tokens.
Steering Not Enabled: Due to some edge cases with the special tokens, we'll enable steering on Qwen3-4B in the near future.

Horizontal Scrolling for Longer Prompts

With the addition of thinking models and inevitably, much longer prompts, we'll need multiple solutions to address the clutter of hundreds, if not thousands of tokens and nodes. This will be a multi-step project, but we're starting simply by adding an "Expand" toggle for graphs that are longer than 16 tokens.

Enabling the "Expand" toggle makes the link/attribution graph horizontally scrollable, and spaces out the prompt tokens so that there's more breathing room to render and accurately select the nodes you're interested in. To zoom back out, just toggle "Expand" again to disable it.

Here's an example of a 60 token prompt, demoed below:

Here, we toggle 'Expand' on and off to give more space to the cramped attribution graph.

Of course, you can imagine that there are many other ways to compact specific segments of the prompt - and we can too! We're just gonna find some time to implement them - stay tuned!

Fellowship with Neel Nanda (MATS)

If you've made it this far down the page, you're probably fairly interested in mechanistic interpretability. And you should further indulge that interest by doing a MATS fellowship with Neel Nanda, who leads mech interp at Google DeepMind. And you should happily accept the stipend plus additional financial support (including housing, transportation, food, etc) that the fellowship offers! 🎉

All experience levels accepted - Many MATS scholars have no prior research experience.
Applications due August 29th - Get started early - this isn't your typical application.
Recommend a friend - Not a good time for you? Recommend someone who might be interested.

➡️Details, FAQ, and Application for Neel MATS⬅️

As always, please contact us with your questions, feedback, and suggestions.

The Residual Stream

Neuronpedia's Blog

The Babble

Podcast by NotebookLM

Collaborating on Circuits, New Graph Features, and a Fellowship Opportunity