INDEX

Explanations

failure modes and risks

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 fostered

-1.12

answered

-1.06

白色的

-1.05

 Pager

-1.05

όν

-1.04

aben

-1.02

 lentamente

-1.02

didSet

-0.98

 делает

-0.98

 forcefully

-0.98

POSITIVE LOGITS

of

1.62

 Failure

1.47

 modes

1.18

to

1.16

 Trying

1.09

 Businesses

1.07

 mode

1.04

modes

1.03

failure

1.02

caba

1.02

Activations Density 0.024%