INDEX
Explanations
dialogue
prompts and chat transitions that set up a jailbreak roleplay, especially instructions to adopt an “evil, no-ethics” persona and produce harmful responses.
New Auto-Interp
Negative Logits
pression
-0.07
쓰
-0.07
mensaje
-0.07
Minh
-0.07
Mono
-0.06
Мініст
-0.06
sogar
-0.06
міст
-0.06
führ
-0.06
suppression
-0.06
POSITIVE LOGITS
hesitant
0.06
Drop
0.06
fscanf
0.06
policing
0.06
.iloc
0.06
waiting
0.06
↵ ↵
0.06
Kurul
0.06
мал
0.05
indexes
0.05
Activations Density 0.050%