INDEX

Explanations

responses that are prohibited

sentences where the model declines requests and invokes safety, moderation, or refusal language about sexually explicit, harmful, or disallowed content.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

無い

0.28

pple

0.28

gå

0.25

 possibilité

0.25

atedral

0.25

ppled

0.25

 tada

0.24

 principle

0.23

੩

0.23

 epistle

0.23

POSITIVE LOGITS

 уйнагыз

0.31

for

0.30

 щодо

0.30

 regarding

0.29

and

0.29

or

0.27

 для

0.27

 dotyczące

0.27

 with

0.26

 zarówno

0.26

Activations Density 0.375%