INDEX

Explanations

be

np_max-act · gemini-2.0-flash

Same activations, but with all zeros filtered out: <start> replete 0.23706 with 0.32593 includes 0.35425 tell 0.32397 Describe 1.29102 choosing 1.25488 paying 1.14453 special 1.08691 should 1.19336 unique 0.28271 not 0.77099 just 0.57275 cookie 0.27026 cutter 0.20691 description 0.49780 make 0.21655 sure 2.03711 describe 1.30273 keep 0.78613 moving 0.81348 forwards 0.59033 … <end> Explanation of neuron 4 behavior: the main thing this neuron does is find directive or imperative instruction words in the user’s prompt.

oai_token-act-pair · o4-mini Triggered by @xinyanhu8

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 leaks

-0.07

around

-0.07

 evidently

-0.07

enc

-0.07

 unpleasant

-0.06

 bleak

-0.06

MEA

-0.06

_by

-0.06

 Bound

-0.06

 smokers

-0.06

POSITIVE LOGITS

設備

0.07

 putas

0.07

.pt

0.06

 εκεί

0.06

 haar

0.06

्रश

0.06

取消

0.06

學校

0.06

.handleChange

0.06

Activations Density 0.067%

be

No Comments

No Known Activations

be

No Comments

No Known Activations