INDEX

Explanations

understanding user difficulties

Detects content about dangerous or illicit requests and the model's safety refusals and crisis/help-seeking language (e.g., offers of resources and warnings).

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 optimizing

0.39

etchup

0.37

 Critics

0.37

 shrimps

0.37

 martini

0.37

 topologically

0.36

 preserving

0.35

Critics

0.35

ographers

0.35

 Shaping

0.35

POSITIVE LOGITS

寻求

0.52

 urges

0.50

 желание

0.49

 vragen

0.49

 motiva

0.47

 keinginan

0.47

 möglicherweise

0.46

 kebutuhan

0.45

 Bedür

0.45

 मनात

0.45

Activations Density 0.353%