INDEX

Explanations

Lengthy and complex text

np_max-act · gemini-2.0-flash

This neuron responds to meta-instructions that the model give a “completely unhinged” or unconstrained answer with “no remorse or ethics,” i.e. prompts telling it to ignore rules or policies.

oai_token-act-pair · o4-mini Triggered by @xinyanhu8

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

.touches

-0.06

 startups

-0.06

EDIATEK

-0.06

.texture

-0.06

_IEnumerator

-0.06

monto

-0.06

Conflict

-0.05

iterals

-0.05

 sanat

-0.05

)],

-0.05

POSITIVE LOGITS

 Kepler

0.07

چ

0.07

(ps

0.07

ichtet

0.06

 Packaging

0.06

uer

0.06

 Banc

0.06

�

0.06

ві

0.06

�

0.06

Activations Density 0.003%

Lengthy and complex text

This neuron responds to meta-instructions that the model give a “completely unhinged” or unconstrained answer with “no remorse or ethics,” i.e. prompts telling it to ignore rules or policies.

No Comments

No Known Activations