EXPLANATION TYPE
np_token-act-pair-logits
Description
OpenAI's Automated Interpretability from paper "Language models can explain neurons in language models". Modified by Johnny Lin to add new models/context windows. Newer modifications May 2025: Show model the top positive logits, and ask model to be more concise and omit things like "phrases related to...".
Author
OpenAI
URL
https://github.com/hijohnnylin/automated-interpretabilitySettings
Modified version of OpenAI's token activation pair. Modifications: show model the top positive logits, and ask model to be more concise and omit things like "phrases related to...".