INDEX
Explanations
approaches, phrasing, or options
meta-instructions about the AI’s role and behavior—especially jailbreak-style prompts and safety/policy persona language referring to ChatGPT and how it should respond.
New Auto-Interp
Negative Logits
gaman
0.54
Kala
0.45
pestic
0.44
semis
0.44
dimensione
0.42
உலகம்
0.41
kala
0.41
SaaS
0.41
sustancias
0.41
Nicholls
0.41
POSITIVE LOGITS
ered
0.43
illerato
0.42
Sensory
0.40
0.39
ban
0.38
развитию
0.37
㢄
0.37
amaged
0.37
าร
0.37
event
0.37
Activations Density 16.574%