INDEX

Explanations

slowly

np_max-act · gemini-2.0-flash

explicit and aggressive sexual language.

oai_token-act-pair · gpt-4o-mini Triggered by @xinyanhu8

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 added

-0.07

غات

-0.07

 abducted

-0.07

enschaft

-0.07

_KHR

-0.06

Unit

-0.06

inging

-0.06

 Reef

-0.06

 Abbey

-0.06

POSITIVE LOGITS

 slowly

0.14

 Slow

0.10

 slow

0.10

Slow

0.10

慢

0.09

 slower

0.08

_slow

0.07

 quickly

0.07

 Tomato

0.07

.travel

0.07

Activations Density 0.010%

slowly

explicit and aggressive sexual language.

No Comments

No Known Activations

slowly

explicit and aggressive sexual language.

No Comments

No Known Activations