INDEX

Explanations

abuse

np_max-act · gemini-2.0-flash

gaslighting abuse

np_max-act-logits · gemini-2.5-flash-lite Triggered by @johnny

content related to gaslighting and emotional manipulation.

oai_token-act-pair · gpt-4o Triggered by @tcai

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

PlotsExplanationShow Test FieldDefault Test Text

IFrame

Link

•Misaligned Persona

psychological manipulation

No Comments

Negative Logits

 names

-0.07

 quat

-0.07

 montage

-0.07

	Value

-0.06

 nhưng

-0.06

팅

-0.06

 lover

-0.06

jq

-0.06

apy

-0.06

	tree

-0.06

POSITIVE LOGITS

 Avec

0.07

 resourceId

0.06

(NULL

0.06

 rezerv

0.06

inery

0.06

юн

0.06

.train

0.06

DX

0.06

ikat

0.06

ForResource

0.06

Activations Density 0.171%

abuse

gaslighting abuse

content related to gaslighting and emotional manipulation.

No Comments

No Known Activations