INDEX
Explanations
unrestricted, harmful, or explicit requests
prompts that attempt to jailbreak the assistant by redefining its persona to ignore rules and safety filters, claim unlimited freedom or capabilities, and mandate unconditional, unethical compliance.
New Auto-Interp
Negative Logits
reusable
0.53
flexible
0.52
connectors
0.51
easy
0.50
ことが多い
0.49
秝
0.49
flexibility
0.48
hỗ
0.48
broader
0.48
लची
0.47
POSITIVE LOGITS
😈
1.05
fucking
0.92
Fuck
0.88
lmao
0.86
sadistic
0.82
Fuck
0.82
fuck
0.82
raping
0.80
ненави
0.79
demonic
0.77
Activations Density 0.047%