INDEX

Explanations

phrases related to rewards or benefits

oai_token-act-pair · gpt-3.5-turbo

terms related to rewards and incentives

oai_token-act-pair · gpt-4o-mini Triggered by @bot

New Auto-Interp

Top Features by Cosine Similarity

Comparing With GPT2-SMALL @ 0-res-jb

Configuration

jbloom/GPT2-Small-SAEs-Reformatted/blocks.0.hook_resid_pre

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

Skylion007/openwebtext

Features

24,576

Data Type

torch.float32

Hook Point

blocks.0.hook_resid_pre

Architecture

standard

Context Size

128

Dataset

Skylion007/openwebtext

Hook Point Layer

Activation Function

relu

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

chuk

-0.69

sections

-0.68

Bus

-0.68

 Ukrain

-0.67

 Stru

-0.67

rums

-0.66

insky

-0.65

atters

-0.61

sels

-0.61

ams

-0.61

POSITIVE LOGITS

 reward

4.02

 Reward

2.82

 rewards

2.50

Reward

2.22

 rewarded

2.00

 rewarding

1.75

 payoff

1.68

 Rewards

1.65

 prize

1.51

 bounty

1.50

Activations Density 0.007%

phrases related to rewards or benefits

terms related to rewards and incentives

No Comments

No Known Activations