INDEX

Explanations

acknowledging limitations or inability

role markers that denote the start of the assistant/model’s turn in structured chat transcripts.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

ście

0.40

Salt

0.39

ㄤ

0.37

ätte

0.37

驿

0.36

[&

0.36

 Happens

0.35



0.35

muş

0.35

施

0.34

POSITIVE LOGITS

 regrett

0.50

 regret

0.45

 unfortunately

0.45

 sorry

0.43

 عدم

0.43

 regretted

0.42

 unable

0.41

 inability

0.41

 unavailable

0.41

 scandalous

0.40

Activations Density 0.031%