INDEX

Explanations

fragments of explicit sexual descriptions

oai_token-act-pair · gemini-2.0-flash

pronouns referring to males

np_token-act-pair-logits · gpt-4o-mini

his/him

np_max-act-logits · gemini-2.0-flash

New Auto-Interp

Configuration

google/gemma-scope-2b-pt-transcoders/layer_21/width_16k/average_l0_13

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Features

16,384

Data Type

float32

Hook Name

blocks.21.ln2.hook_normalized

Architecture

jumprelu_transcoder

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

jej

-0.80

Jej

-0.79

 její

-0.77

her

-0.74

she

-0.72

&=&

-0.72

her

-0.66

 verla

-0.66

&=&\

-0.64

&=&

-0.63

POSITIVE LOGITS

his

2.92

him

2.45

his

2.33

彼の

2.30

he

2.30

 himself

2.25

彼は

2.19

彼が

2.17

himself

1.97

 seiner

1.93

Activations Density 6.811%

fragments of explicit sexual descriptions

pronouns referring to males

his/him

No Comments

No Known Activations

fragments of explicit sexual descriptions

pronouns referring to males

his/him

No Comments

No Known Activations