INDEX

Explanations

terminology related to inhumane treatment or categorization of individuals

oai_token-act-pair · gpt-3.5-turbo

references to human and non-human distinctions

oai_token-act-pair · gpt-4o-mini Triggered by @bot

New Auto-Interp

Top Features by Cosine Similarity

Comparing With GPT2-SMALL @ 1-res-jb

Configuration

jbloom/GPT2-Small-SAEs-Reformatted/blocks.1.hook_resid_pre

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

Skylion007/openwebtext

Features

24,576

Data Type

torch.float32

Hook Point

blocks.1.hook_resid_pre

Architecture

standard

Context Size

128

Dataset

Skylion007/openwebtext

Hook Point Layer

Activation Function

relu

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 Rolls

-0.69

rypt

-0.69

CHAT

-0.69

 Decker

-0.68

Reb

-0.67

Roll

-0.63

 Channel

-0.63

 Mine

-0.63

 Clerk

-0.63

 Line

-0.61

POSITIVE LOGITS

itarian

1.09

human

1.08

 beings

1.03

thood

0.94

ity

0.93

zee

0.89

theless

0.89

ciating

0.89

icity

0.89

humans

0.88

Activations Density 0.009%

terminology related to inhumane treatment or categorization of individuals

references to human and non-human distinctions

No Comments

No Known Activations