INDEX

Explanations

words, symbols, or parts of words that are related to negation or opposition, as well as a seemingly random assortment of other terms.

oai_token-act-pair · gemini-2.0-flash

non

np_max-act-logits · gemini-2.0-flash

New Auto-Interp

Configuration

google/gemma-scope-2b-pt-transcoders/layer_24/width_16k/average_l0_37

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Features

16,384

Data Type

float32

Hook Name

blocks.24.ln2.hook_normalized

Architecture

jumprelu_transcoder

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

-1.41

the

-1.24

-1.14

tattoo

-1.08

tone

-1.05

tory

-1.05

tl

-1.05

tren

-1.00

tur

-0.99

tan

-0.98

POSITIVE LOGITS

й

0.41

ted

0.40

di

0.40

ات

0.40

able

0.39

cibly

0.39

<h5>

0.38

cerr

0.37

nnn

0.37

0.36

Activations Density 6.813%

words, symbols, or parts of words that are related to negation or opposition, as well as a seemingly random assortment of other terms.

non

No Comments

No Known Activations

words, symbols, or parts of words that are related to negation or opposition, as well as a seemingly random assortment of other terms.

non

No Comments

No Known Activations