INDEX

Explanations

acknowledging and refusing harmful requests

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 ليست

0.46

şe

0.45

 മിക

0.45

şa

0.44

ކ

0.43

asserie

0.42

ický

0.42

 lakini

0.42

indi

0.41

agh

0.41

POSITIVE LOGITS

 datos

0.46

 prohib

0.45

 extractor

0.44

 extraction

0.43

 Beverly

0.43

 interdit

0.43

 запре

0.43

덧

0.42

 separators

0.42

 recherches

0.42

Activations Density 0.010%