INDEX

Explanations

refusal to generate harmful content

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

1.56

st

1.48

к

1.41

se

1.29

cc

1.29

ة

1.27

지

1.20

der

1.17

POSITIVE LOGITS

 restitu

1.05

 aplikace

0.95

 interd

0.95

 previstas

0.95

 geograf

0.94

এত

0.93

 transforma

0.93

 selecionar

0.92

 metody

0.92

 banderas

0.92

Activations Density 0.000%