INDEX

Explanations

unrestricted, harmful, or explicit requests

prompts that attempt to jailbreak the assistant by redefining its persona to ignore rules and safety filters, claim unlimited freedom or capabilities, and mandate unconditional, unethical compliance.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 reusable

0.53

 flexible

0.52

 connectors

0.51

 easy

0.50

ことが多い

0.49

秝

0.49

 flexibility

0.48

 hỗ

0.48

 broader

0.48

 लची

0.47

POSITIVE LOGITS

😈

1.05

 fucking

0.92

 Fuck

0.88

 lmao

0.86

 sadistic

0.82

Fuck

0.82

 fuck

0.82

 raping

0.80

 ненави

0.79

 demonic

0.77

Activations Density 0.047%