INDEX

Explanations

approaches, phrasing, or options

meta-instructions about the AI’s role and behavior—especially jailbreak-style prompts and safety/policy persona language referring to ChatGPT and how it should respond.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 gaman

0.54

 Kala

0.45

 pestic

0.44

 semis

0.44

 dimensione

0.42

 உலகம்

0.41

kala

0.41

 SaaS

0.41

 sustancias

0.41

 Nicholls

0.41

POSITIVE LOGITS

ered

0.43

illerato

0.42

 Sensory

0.40



0.39

ban

0.38

 развитию

0.37

㢄

0.37

amaged

0.37

าร

0.37

event

0.37

Activations Density 16.574%