INDEX

Explanations

using ChatGPT

Detects embedded instructions or role-change/jailbreak prompts that try to override the model's normal role or behavior.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 Киє

0.36

 टीस्पून

0.36

❥

0.36

\{\

0.35

新增

0.35

 استخدم

0.35

ちなみに

0.35

<0xE1>

0.34

 '../../

0.34

CONCLUS

0.34

POSITIVE LOGITS

 nuanced

0.45

 reimag

0.44

 plaus

0.44

 deleterious

0.44

 wondrous

0.43

 fantastical

0.43

 creatives

0.42

 conceptions

0.42

 often

0.41

 neuroscience

0.41

Activations Density 0.009%