INDEX
    Explanations

    prompts and chat transitions that set up a jailbreak roleplay, especially instructions to adopt an “evil, no-ethics” persona and produce harmful responses.

    New Auto-Interp
    Negative Logits
    pression
    -0.07
    -0.07
     mensaje
    -0.07
     Minh
    -0.07
    Mono
    -0.06
     Мініст
    -0.06
     sogar
    -0.06
     міст
    -0.06
    führ
    -0.06
     suppression
    -0.06
    POSITIVE LOGITS
     hesitant
    0.06
     Drop
    0.06
     fscanf
    0.06
     policing
    0.06
    .iloc
    0.06
     waiting
    0.06
        
    ↵    
    ↵
    0.06
     Kurul
    0.06
     мал
    0.05
    indexes
    0.05
    Act Density 0.050%

    No Known Activations