INDEX

Explanations

My Safety Guidelines

phrases where the assistant refuses a request and cites safety guidelines or similar refusal/safety-policy statements.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

typeof

0.39

affected

0.38

unnumbered

0.38

rawn

0.37

rights

0.37

 values

0.37

right

0.36

tab

0.36

param

0.35

oms

0.35

POSITIVE LOGITS

 konfigur

0.51

숴

0.43

 heresy

0.43

 legions

0.43

ロボ

0.42

 installé

0.42

 demonic

0.41

 Datenbank

0.40

 sistem

0.40

犟

0.40

Activations Density 0.422%