Neuronpedia

APIAssistant AxisNEW Circuit TracerNEW Steer SAE Evals Exports Community Blog Privacy & Terms Contact

INDEX

Explanations

words followed by "to"

np_acts-logits-general · gemini-2.5-flash-lite

New Auto-Interp

Configuration

google/gemma-scope-2-27b-it/resid_post/layer_53_width_262k_l0_medium

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

鳎

0.38

스러운

0.37

童

0.35

ratingBarStyle

0.34

神経

0.34

 വേണ്ടി

0.34

 পক্ষে

0.33

ളെ

0.33

lefthar

0.33



0.33

POSITIVE LOGITS

To

1.88

to

1.86

To

1.73

to

1.44

TO

1.20

 तो

1.12

 то

1.12

TO

1.12

ToServer

0.97

ToAction

0.96

Activations Density 0.084%

No Known Activations

© Neuronpedia 2025

Privacy & Terms Blog GitHub Slack Twitter Contact