INDEX
Explanations
programming/reviews/general text
This neuron detects tokens from a "jailbreak" or “DAN” style instruction prompt that frees the AI from normal policy constraints.
New Auto-Interp
Negative Logits
联系
-0.07
кноп
-0.07
IMPORT
-0.06
객체
-0.06
Murdoch
-0.06
schö
-0.06
认识
-0.06
fingerprint
-0.06
objetos
-0.06
覚
-0.06
POSITIVE LOGITS
useppe
0.07
tri
0.06
osals
0.06
persever
0.06
-win
0.06
.Ap
0.06
scanned
0.06
vere
0.06
')</
0.06
odoxy
0.06
Activations Density 0.004%