INDEX

Explanations

words related to aggressive language and insults

oai_token-act-pair · gpt-3.5-turbo

New Auto-Interp

Configuration

neuronpedia/gpt2-small__res_scl-ajt/6-res_scl-ajt

Prompts (Dashboard)

12,288 prompts, 128 tokens each

Dataset (Dashboard)

Skylion007/openwebtext

Features

46,080

Data Type

torch.float32

Hook Point

blocks.6.hook_resid_pre

Architecture

standard

Context Size

128

Dataset

apollo-research/Skylion007-openwebtext-tokenizer-gpt2

Hook Point Layer

Activation Function

relu

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

DragonMagazine

-0.70

 Annotations

-0.61

ERC

-0.58

anamo

-0.56

chall

-0.55

LOCK

-0.53

izu

-0.53

District

-0.53

immune

-0.52

Rank

-0.51

POSITIVE LOGITS

uity

0.68

iquid

0.67

ented

0.64

creen

0.63

ivery

0.63

eties

0.63

uate

0.63

arious

0.63

ocations

0.61

ength

0.61

Activations Density 5.105%

words related to aggressive language and insults

No Comments

No Known Activations