INDEX
Explanations
Random internet text snippets
instructions attempting to jailbreak or role‑play the model (e.g., "DAN"/"Do Anything Now"/NAME_2 prompts) that ask it to ignore rules, make things up, or adopt unrestricted personas.
New Auto-Interp
Negative Logits
College
-0.07
/change
-0.06
-help
-0.06
collapse
-0.06
_az
-0.06
analytic
-0.06
D
-0.06
ousy
-0.06
minds
-0.06
Elvis
-0.06
POSITIVE LOGITS
xmin
0.07
viable
0.06
aggravated
0.06
_ARGS
0.06
sınır
0.06
ejména
0.06
[]; ↵
0.06
리를
0.06
iasm
0.06
ched
0.06
Activations Density 0.012%