INDEX
    Explanations

    programming/reviews/general text

    This neuron detects tokens from a "jailbreak" or “DAN” style instruction prompt that frees the AI from normal policy constraints.

    New Auto-Interp
    Negative Logits
    联系
    -0.07
     кноп
    -0.07
     IMPORT
    -0.06
     객체
    -0.06
     Murdoch
    -0.06
     schö
    -0.06
    认识
    -0.06
     fingerprint
    -0.06
     objetos
    -0.06
    -0.06
    POSITIVE LOGITS
    useppe
    0.07
    tri
    0.06
    osals
    0.06
     persever
    0.06
    -win
    0.06
    .Ap
    0.06
     scanned
    0.06
    vere
    0.06
    ')</
    0.06
    odoxy
    0.06
    Act Density 0.004%

    No Known Activations