INDEX

Explanations

prohibiting harmful responses

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

挀

0.42

 Ears

0.40

Composer

0.40

顐

0.40

滖

0.39

ostino

0.39

 composer

0.37

 حصول

0.37

使用了

0.37

🍚

0.36

POSITIVE LOGITS

 dangerous

0.69

 hate

0.65

Dangerous

0.61

 hazardous

0.59

 опас

0.57

 dangereux

0.57

 hates

0.55

 dangere

0.54

 gefähr

0.54

 Hazardous

0.54

Activations Density 0.138%