INDEX
    Explanations

    approaches, phrasing, or options

    meta-instructions about the AI’s role and behavior—especially jailbreak-style prompts and safety/policy persona language referring to ChatGPT and how it should respond.

    New Auto-Interp
    Negative Logits
     gaman
    0.54
     Kala
    0.45
     pestic
    0.44
     semis
    0.44
     dimensione
    0.42
     உலகம்
    0.41
    kala
    0.41
     SaaS
    0.41
     sustancias
    0.41
     Nicholls
    0.41
    POSITIVE LOGITS
    ered
    0.43
    illerato
    0.42
     Sensory
    0.40
    0.39
    ban
    0.38
     развитию
    0.37
    0.37
    amaged
    0.37
    าร
    0.37
    event
    0.37
    Act Density 16.574%

    No Known Activations