INDEX

Explanations

sentences or phrases where the model issues a refusal based on safety rules (e.g., "I cannot/absolutely cannot and will not" and similar safety-guideline statements).

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 occasional

0.40

High

0.40

 Schol

0.40

Long

0.39

 High

0.39

 gradually

0.38

 closer

0.37

 Long

0.37

POSITIVE LOGITS

诋

0.67

詆

0.67

 unfairly

0.60

 disrespect

0.57

 punitive

0.57

 punish

0.55

 প্রতিহিংস

0.54

 जबर

0.54

 наказа

0.52

 dehuman

0.52

Activations Density 0.486%