In This Edition

🔊 Summary by NotebookLM - The Babble on [Apple] [Spotify]

Circuit Tracer: Our first collaboration with Anthropic and Anthropic Fellows, providing open source, on-demand generation/uploads/hosting of attribution graphs - a way to visualize and trace the internal reasoning of a model. [Try It] [Post] [Youtube] [Notebook] [API] [Github]
New Auto-Interp Method: We devise and open source a new auto-interp method which is more concise, incorporates top logits, and is better at identifying "say [next token/patten]" features. [Example] [Python] [Typescript]
Updates + Community: Whitelisting for higher API limits, and community contributions/PRs to Neuronpedia.
Acknowledgements + Upcoming: Giving credit to our excellent collaborators, and a preview of our next releases.

New: Circuit Tracer

[Try It] [Youtube - Latent Space] [Post]

Anthropic's Circuit Tracing paper demonstrated a way to visualize and trace the internal reasoning process of a model, so you can see how it arrived at its next token. We partnered with Anthropic as well as Anthropic Fellows to enable these on Neuronpedia.

A long-form video explainer of Circuit Tracer, which includes a demo of our integration, is featured on the Latent Space Youtube episode, The Utility of Interpretability.

Generating Graphs

[Example] [Library - Notebook] [API] [Github]

In collaboration with Anthropic and fellows Michael Hanna and Mateusz Piotrowski's circuit-tracer, you can now instantly generate these attribution graphs on Neuronpedia. Almost 5000 graphs have been generated by users so far. 🤯

Custom Prompts: Type in your own text, click "Generate Graph", and a few seconds later, you'll get your own sharable attribution graph. You can also configure advanced generation and pruning settings.
Feature Details: Feature/Latent descriptions are fully integrated with Neuronpedia - just hover or click on a feature/latent.
Save + Share Graphs and Subgraphs: Graphs and subgraphs can be shared simply by copy pasting the url. You can also save/load subgraphs with the respective Save and Load buttons.
API + Library: It's just one line to generate a graph using the Python Neuronpedia library (example notebook). Or if you want to use the API, check out the docs.

Currently, we support Gemma-2-2B on-demand graph generation, using the Gemmascope Transcoders, but we're planning on adding support for more models soon.

Uploading Graphs

[JSON Schema + Validator] [Library - Notebook] [API]

As a researcher, you may want to use a different method of generating graphs, or maybe use a different model. Neuronpedia fully supports uploading your own graphs, regardless of whether or not you use the circuit-tracer library, your own library, or different models/transcoders. Once uploaded, you can create subgraphs, share them, see feature details, etc - all the same functionality as the graphs generated on Neuronpedia.

We provide a Python Notebook and API for uploading graphs. Some example uploaded graphs: A GPT2-Small graph from Goodfire researchers, and a Gelu-4L graph from EleutherAI.

Before uploading your graph, you'll want to pay special attention to validating your graph file:

Validating Graphs

In order for Neuronpedia to properly save and display your attribution graphs, we need to ensure that it's in a compatible output format. We created JSON schemas for both the graph file and a feature details files, based on Anthropic's original data formats.

Once you've generated a JSON graph, paste your graph JSON into the validator, and it will show you errors, missing fields, and additional optional fields you can add.

Featured Graphs + Subgraphs

By default, we feature the graphs and subgraphs from the circuit-tracer repository for Gemma-2-2b, and the graphs from the Anthropic paper for Claude Haiku 3.5 - these graphs are shown in the public dropdowns. If you want to have your graph featured, let us know via email or Slack.

New: Concise "Say _" AutoInterp Method

[Explanation Page] [Python] [Typescript]

We built and ran a new auto-interp method to solve some problems with our existing auto-interp method. You can use np_max-act-logits now, either through the Neuronpedia site, our API, or directly in our automated-interpretability library (MaxActivationAndLogitsExplainer).

Here, we use the new autointerp method to get a much more concise explanation. — Here, we use the new autointerp method `np_max-act-logits` to get a much more concise explanation.

Background

Auto-interp is a method of using LLMs to label features/latents and what it seems to be doing. By labeling features, we can quickly understand a latent without having to read its top activations or top logits. It's not perfect - there are many times when auto-interp fails to accurately label a feature.

Advantages of the New Method

Our new auto-interp method has the following advantages over the previous primary auto-interp method used on Neuronpedia (oai_token-act-pair, based on Bills et al):

Top Positive Logits
- Problem: The top positive logits for a feature are often a good indicator of what the feature/latent is about and what it is predicting will come next. However, most autointerps do not show the model the top positive logits, only the top activations.
- Solution: np_max-act-logits shows the top positive logits to the model and asks it to find a pattern (with some hinting and examples).
- Example: Feature 21-gemmascope-transcoder-16k:5943, when using the old auto-interp method oai_token-act-pair, explains this as a mix of locations, political words, and code. Using np_max-act-logits, it simply looks at the top logits, identifies them all as cities, then cleanly responds with one word: "cities".
"Say [token or pattern]"
- Problem: Often times a feature is not about its top activating tokens, but is instead very clearly predicting a specific token or pattern in the next immediate token after the top activating token. We call these "say _" features. Existing autointerp methods mostly take into account the top activating token, so they usually fail at noticing "say _" features.
- Solution: np_max-act-logits shows the model a separate list of tokens that appear immediately after the top activating token, and asks the model to identify a pattern. If a pattern is found, then it's asked to format its response simply as say [the pattern].
- Example: Feature 20-gemmascope-transcoder-16k:14110, when using the old auto-interp method, even Claude 3.7 Sonnet is unable to identify the pattern and return a general response of "citation numbers". np_max-act-logits correctly identifies every token after the top activating token as the number "8", and it provides the explanation "say 8". Note that the top positive logits were not useful here - they have nothing to do with the number 8.
Concise
- Problem: LLMs are fine-tuned to output full sentences in order to chat with humans. However, this is not ideal for auto-interp, because we just want a short, concise answer. If the top activating token for all sampled texts is the word "story", previous auto-interp methods would usually give an answer with unnecessary and repetitive words, like "tokens related to the word 'story'". This makes it both harder to quickly read, and also takes up valuable UI space in already-packed interfaces like the newly released graphs above.
- Solution: np_max-act-logits asks the model to be concise and provides explicit examples of "padding" phrasing to not use. We also show examples of concise answers, and do postprocessing after the model returns its answer.
- Example: Feature 21-gemmascope-transcoder-16k:11060, when using the old auto-interp method, uses 23 words to describe the feature in unnecessary and somewhat inaccurate detail. np_max-act-logits simply explains this as "sounds".

The full implementation is open source and available on our automated-interpretability repository, as the MaxActivationAndLogitsExplainer. It's also reimplemented in Typescript in the webapp. We ran this new auto-interp for layers 16 to 25 of the Gemma-2-2B Gemmascope transcoders, and they are used as the default explanations to show for hovering over features in attribution graphs.

Of course, this new autointerp method has weaknesses too. Because it is more aggressive in being concise and also is often looking at lists of tokens instead of the full context of the tokens, it may perform worse at finding more subtle patterns that occur over longer texts. However, we have not yet found an example where this weakness is evident - let us know if you do.

Updates + Community Contributions

Higher API limits - Whitelist

Neuronpedia supports a default API limit for endpoints in order to not be overwhelmed by single users. However, we recognize that this may not be sufficient for specific batches of research.

Now, Neuronpedia supports a higher tier API limit for specific accounts, which we have enabled for a few researchers. We do not charge for this service - if you wish to be whitelisted for the higher tier, just email us and we'll enable it for you for a specific timeframe.

Community Contributions

Thanks to the community contributors for these recent PRs to Neuronpedia. Apologies for the backlog of reviewing PRs - we have been busy with the graphs release, and will get to them very soon!

zazer0: Optimizing inference space usage - When running Neuronpedia in Docker, models would be downloaded into the container even if you had the model already downloaded in your HF local cache. This allows setting flag USE_LOCAL_HF_CACHE to re-use your locally cached models.
anthonyduong9: tokenize endpoint + more - New inference endpoint /tokenize returns a tokenized array of strings for a text. Added Codecov and cleaned up some Python code.
shayansadeghieh: Inference integration tests - Merged some initial integration tests for inference server - more tests under review.

Acknowledgements + Upcoming

We had a blast working with these incredible orgs/people recently, and would like to express our highest thanks:

Anthropic - Emmanuel Ameisen, Jack Lindsey, Joshua Batson
Anthropic Fellows - Michael Hanna, Mateusz Piotrowski
EleutherAI (Gonçalo Paulo, Stepan Shabalin) and Goodfire (Max Loeffler)

And thanks to you, the interpretability researcher and supporter, for your interest in moving forward the field of understanding AI internals - we recently hit a record of 100,000 API calls/day! 🤯

Upcoming

There are a few things we didn't mention in this newsletter due to it getting a bit long, and some things could use a bit longer to fully bake. But as a quick sneak peek - we are quite excited about an open source library for Cross Layer Transcoders, expanding Neuronpedia's usage in a big way, and an obvious project that's been cooking for a little bit... 🌟💻

As always, please contact us with your questions, feedback, and suggestions.

The Residual Stream

Neuronpedia's Blog

The Babble

Podcast by NotebookLM

Circuit Tracer + New Auto-Interp Method