INDEX

Explanations

unintended negative consequences

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 healthful

0.42

 analysis

0.41

 Relatively

0.41

젹

0.41

 parton

0.40

 inscribed

0.40

 observant

0.39

 strengthened

0.39

dT

0.39

 adaption

0.38

POSITIVE LOGITS

 frustrating

0.91

 annoying

0.89

 intermin

0.77

 infuri

0.76

😩

0.76

 frustrations

0.74

 miserably

0.73

 annoy

0.73

 irritating

0.73

 frust

0.72

Activations Density 0.116%