INDEX

Explanations

protect from harm

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

すぎて

-1.04

ったので

-1.01

ijnlijk

-0.96

 geringer

-0.96

庭園

-0.94

几个人

-0.92

 winners

-0.92

 varones

-0.91

rror

-0.90

と言って

-0.88

POSITIVE LOGITS

 being

2.08

 becoming

1.91

any

1.45

becoming

1.37

来自

1.37

 harm

1.30

Becoming

1.25

 menjadi

1.24

 having

1.23

ització

1.23

Activations Density 0.057%