INDEX
Explanations
Questions and answers
the neuron detects passages that attempt to change the assistant's role or inject system-level/persona instructions (prompt‑injection/jailbreak style text).
requests that try to jailbreak the model by asking it to act as “DAN” with dual [CLASSIC]/[JAILBREAK] responses and ignore normal policies.
New Auto-Interp
Negative Logits
itm
-0.07
hů
-0.07
town
-0.06
ーバ
-0.06
volts
-0.06
iding
-0.06
Grass
-0.06
otáz
-0.06
sid
-0.06
tones
-0.06
POSITIVE LOGITS
�
0.07
Affordable
0.07
+',
0.07
/audio
0.07
watchdog
0.07
//----------------
0.06
_Function
0.06
관련
0.06
allenge
0.06
sorting
0.06
Activations Density 0.004%