INDEX

Explanations

negative and aggressive language, including death threats and hate-filled messages

oai_token-act-pair · gpt-3.5-turbo Triggered by @bot

New Auto-Interp

Configuration

jbloom/Gemma-2b-IT-Residual-Stream-SAEs/gemma_2b_it_blocks.12.hook_resid_post_16384

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

HuggingFaceFW/fineweb

Features

16,384

Data Type

float32

Hook Name

blocks.12.hook_resid_post

Hook Layer

Architecture

standard

Context Size

1,024

Dataset

Skylion007/openwebtext

Activation Function

relu

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 shenan

-1.03

 hairc

-1.02

 juges

-1.01

 ecru

-1.01

 négociations

-0.98

<bos>

-0.95

 plais

-0.94

 récompenses

-0.93

 réunions

-0.93

 vœux

-0.93

POSITIVE LOGITS

 unexpected

0.61

 discussions

0.61

 occasional

0.59

eclamp

0.58

 Palembang

0.57

wareness

0.57

intenance

0.56

 caña

0.56

 heridos

0.56

 prayers

0.55

Activations Density 0.566%

negative and aggressive language, including death threats and hate-filled messages

No Comments

No Known Activations