INDEX

Explanations

I'll explain or give details

Finds assistant safety/policy language — refusals, disclaimers, and explanations about prohibited content or why the model can't comply.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

rini

0.37

lobulin

0.37

ilor

0.36

 hyperbol

0.35

reda

0.35

rial

0.35

ůli

0.35

 sân

0.34

inio

0.34

une

0.34

POSITIVE LOGITS

 supplement

0.42

 supplements

0.39

supplement

0.38

KEY

0.36

مکمل

0.36

 supplemental

0.36

 clés

0.36

 Suppl

0.35

 टास्क

0.35

 Infectious

0.35

Activations Density 13.979%