
🔊 Summary by NotebookLM - The Babble on [Apple] [Spotify]
[Try It] [Youtube - Latent Space] [Post]
Anthropic's Circuit Tracing paper demonstrated a way to visualize and trace the internal reasoning process of a model, so you can see how it arrived at its next token. We partnered with Anthropic as well as Anthropic Fellows to enable these on Neuronpedia.
A long-form video explainer of Circuit Tracer, which includes a demo of our integration, is featured on the Latent Space Youtube episode, The Utility of Interpretability.

[Example] [Library - Notebook] [API] [Github]
In collaboration with Anthropic and fellows Michael Hanna and Mateusz Piotrowski's circuit-tracer, you can now instantly generate these attribution graphs on Neuronpedia. Almost 5000 graphs have been generated by users so far. 🤯
Currently, we support Gemma-2-2B on-demand graph generation, using the Gemmascope Transcoders, but we're planning on adding support for more models soon.
[JSON Schema + Validator] [Library - Notebook] [API]
As a researcher, you may want to use a different method of generating graphs, or maybe use a different model. Neuronpedia fully supports uploading your own graphs, regardless of whether or not you use the circuit-tracer library, your own library, or different models/transcoders. Once uploaded, you can create subgraphs, share them, see feature details, etc - all the same functionality as the graphs generated on Neuronpedia.
We provide a Python Notebook and API for uploading graphs. Some example uploaded graphs: A GPT2-Small graph from Goodfire researchers, and a Gelu-4L graph from EleutherAI.
Before uploading your graph, you'll want to pay special attention to validating your graph file:
In order for Neuronpedia to properly save and display your attribution graphs, we need to ensure that it's in a compatible output format. We created JSON schemas for both the graph file and a feature details files, based on Anthropic's original data formats.
Once you've generated a JSON graph, paste your graph JSON into the validator, and it will show you errors, missing fields, and additional optional fields you can add.
By default, we feature the graphs and subgraphs from the circuit-tracer repository for Gemma-2-2b, and the graphs from the Anthropic paper for Claude Haiku 3.5 - these graphs are shown in the public dropdowns. If you want to have your graph featured, let us know via email or Slack.
[Explanation Page] [Python] [Typescript]
We built and ran a new auto-interp method to solve some problems with our existing auto-interp method. You can use np_max-act-logits now, either through the Neuronpedia site, our API, or directly in our automated-interpretability library (MaxActivationAndLogitsExplainer).

Here, we use the new autointerp method np_max-act-logits to get a much more concise explanation.
Auto-interp is a method of using LLMs to label features/latents and what it seems to be doing. By labeling features, we can quickly understand a latent without having to read its top activations or top logits. It's not perfect - there are many times when auto-interp fails to accurately label a feature.
Our new auto-interp method has the following advantages over the previous primary auto-interp method used on Neuronpedia (oai_token-act-pair, based on Bills et al):
np_max-act-logits shows the top positive logits to the model and asks it to find a pattern (with some hinting and examples).oai_token-act-pair, explains this as a mix of locations, political words, and code. Using np_max-act-logits, it simply looks at the top logits, identifies them all as cities, then cleanly responds with one word: "cities".np_max-act-logits shows the model a separate list of tokens that appear immediately after the top activating token, and asks the model to identify a pattern. If a pattern is found, then it's asked to format its response simply as say [the pattern].np_max-act-logits correctly identifies every token after the top activating token as the number "8", and it provides the explanation "say 8". Note that the top positive logits were not useful here - they have nothing to do with the number 8.np_max-act-logits asks the model to be concise and provides explicit examples of "padding" phrasing to not use. We also show examples of concise answers, and do postprocessing after the model returns its answer.np_max-act-logits simply explains this as "sounds".The full implementation is open source and available on our automated-interpretability repository, as the MaxActivationAndLogitsExplainer. It's also reimplemented in Typescript in the webapp. We ran this new auto-interp for layers 16 to 25 of the Gemma-2-2B Gemmascope transcoders, and they are used as the default explanations to show for hovering over features in attribution graphs.
Of course, this new autointerp method has weaknesses too. Because it is more aggressive in being concise and also is often looking at lists of tokens instead of the full context of the tokens, it may perform worse at finding more subtle patterns that occur over longer texts. However, we have not yet found an example where this weakness is evident - let us know if you do.
Neuronpedia supports a default API limit for endpoints in order to not be overwhelmed by single users. However, we recognize that this may not be sufficient for specific batches of research.
Now, Neuronpedia supports a higher tier API limit for specific accounts, which we have enabled for a few researchers. We do not charge for this service - if you wish to be whitelisted for the higher tier, just email us and we'll enable it for you for a specific timeframe.
Thanks to the community contributors for these recent PRs to Neuronpedia. Apologies for the backlog of reviewing PRs - we have been busy with the graphs release, and will get to them very soon!
zazer0: Optimizing inference space usage -
When running Neuronpedia in Docker, models would be downloaded into the container even if you had the model already downloaded in your HF local cache. This allows setting flag USE_LOCAL_HF_CACHE to re-use your locally cached models.
anthonyduong9: tokenize endpoint + more - New inference endpoint /tokenize returns a tokenized array of strings for a text. Added Codecov and cleaned up some Python code.
shayansadeghieh: Inference integration tests - Merged some initial integration tests for inference server - more tests under review.
We had a blast working with these incredible orgs/people recently, and would like to express our highest thanks:
And thanks to you, the interpretability researcher and supporter, for your interest in moving forward the field of understanding AI internals - we recently hit a record of 100,000 API calls/day! 🤯
There are a few things we didn't mention in this newsletter due to it getting a bit long, and some things could use a bit longer to fully bake. But as a quick sneak peek - we are quite excited about an open source library for Cross Layer Transcoders, expanding Neuronpedia's usage in a big way, and an obvious project that's been cooking for a little bit... 🌟💻
As always, please contact us with your questions, feedback, and suggestions.