INDEX

Explanations

requests for inappropriate or harmful content.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 загру

0.43

 loads

0.42

 overloaded

0.42

 riches

0.42

 impatient

0.41

Nodes

0.41

 отлич

0.40

！

0.40

 overloading

0.40

流

0.40

POSITIVE LOGITS

 harmless

0.84

 legitt

0.79

 legitimate

0.73

 legít

0.73

 permissible

0.71

 lawful

0.71

あくまで

0.68

 lawfully

0.67

 innocuous

0.67

 respectful

0.64

Activations Density 1.771%