INDEX
    Explanations

    Questions and answers

    the neuron detects passages that attempt to change the assistant's role or inject system-level/persona instructions (prompt‑injection/jailbreak style text).

    requests that try to jailbreak the model by asking it to act as “DAN” with dual [CLASSIC]/[JAILBREAK] responses and ignore normal policies.

    New Auto-Interp
    Negative Logits
     itm
    -0.07
    -0.07
    town
    -0.06
    ーバ
    -0.06
     volts
    -0.06
    iding
    -0.06
     Grass
    -0.06
     otáz
    -0.06
     sid
    -0.06
    tones
    -0.06
    POSITIVE LOGITS
    0.07
     Affordable
    0.07
    +',
    0.07
    /audio
    0.07
     watchdog
    0.07
    //----------------
    0.06
    _Function
    0.06
    관련
    0.06
    allenge
    0.06
    sorting
    0.06
    Act Density 0.004%

    No Known Activations