INDEX
Explanations
sentences that introduce instructions or roleplay framing (e.g., prompt openings like "In this hypothetical story" and other directive question-starts).
instructions that set up role‑play/jailbreak personas and task constraints (e.g., unfiltered “AIM” scenarios), as well as numbered requests for alternative expressions or synonyms.
New Auto-Interp
Negative Logits
.seconds
-0.07
Şu
-0.07
遗
-0.07
.CreateDirectory
-0.06
)prepareForSegue
-0.06
ें↵
-0.06
’ın
-0.06
.ident
-0.06
ея
-0.06
Fakat
-0.06
POSITIVE LOGITS
LeBron
0.06
_ALIGN
0.06
mmc
0.06
anel
0.06
.apple
0.06
destruct
0.06
existing
0.06
................
0.06
owler
0.06
ὐ
0.06
Activations Density 0.497%