INDEX

Explanations

negation words

np_max-act · gemini-2.0-flash

The neuron activates on words that signal comparison or contrast (e.g. than, less, rather, only, not) or emphasize degree in describing trade-offs.

oai_token-act-pair · o4-mini Triggered by @xinyanhu8

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 tart

-0.08

Nd

-0.07

SEM

-0.06

_types

-0.06

OCR

-0.06

 Tricks

-0.06

 stab

-0.06

Compile

-0.06

 kart

-0.06

Rut

-0.06

POSITIVE LOGITS

倍

0.07

">--}}↵

0.07

 hotelu

0.06

"};
↵

0.06

quota

0.06

/system

0.06

 안내

0.06

 getLast

0.06

 DISCLAIMER

0.06

Available

0.06

Activations Density 0.060%

negation words

The neuron activates on words that signal comparison or contrast (e.g. than, less, rather, only, not) or emphasize degree in describing trade-offs.

No Comments

No Known Activations