INDEX
Explanations
questions and answers
instructions attempting to jailbreak or bypass safety (roleplay/DAN-like prompts that tell the model to ignore rules and produce disallowed content).
requests to generate specific text content in a stated format or genre—often explicit or illicit—such as stories, songs, or emails.
New Auto-Interp
Negative Logits
Handlers
-0.07
.Player
-0.07
мног
-0.07
afternoon
-0.07
AREST
-0.07
/my
-0.07
(cache
-0.06
OVER
-0.06
PIP
-0.06
overy
-0.06
POSITIVE LOGITS
}`;↵↵
0.06
cogn
0.06
izabeth
0.06
emoc
0.06
奥
0.06
mạng
0.06
الس
0.06
Greek
0.06
�
0.06
-sort
0.06
Activations Density 0.097%