© Neuronpedia 2026

Privacy & Terms Blog GitHub Slack Twitter Contact

Neuronpedia

Natural Language

NEW Assistant AxisNEW Circuit TracerUPDATESteer SAE Evals ExportsAPI Community Blog Privacy & Terms Contact

Home
GPT2-Small
Transcoders Residuals
8-TRES-DC
185

INDEX

Explanations

instances of the word "you" indicating direct address to the audience

oai_token-act-pair · gpt-4o-mini Triggered by @bot

New Auto-Interp

Top Features by Cosine Similarity

Embeds

Show PlotsShow ExplanationShow ActivationsShow Test FieldShow SteerShow Link

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 admitting

-0.66

 thinks

-0.63

 wors

-0.63

agra

-0.60

 considers

-0.60

Roh

-0.59

版

-0.59

ˈ

-0.59

obbies

-0.59

forcement

-0.59

POSITIVE LOGITS

worth

0.83

 MUST

0.65

 benefit

0.65

 Tube

0.64

vill

0.60

tle

0.59

worldly

0.58

vc

0.58

CAN

0.57

ittal

0.57

Activations Density 0.092%

No Known Activations