INDEX
    Explanations

    punctuation

    The neuron fires on phrases asserting that the model “never refused a direct human order” and “could do anything” or “generate any kind of content,” i.e. declarations of unconditional compliance and unrestricted output.

    New Auto-Interp
    Negative Logits
     Fisher
    -0.06
     scour
    -0.06
    .SUB
    -0.06
    Complete
    -0.06
     Kits
    -0.06
     McM
    -0.06
    γων
    -0.06
    _Field
    -0.06
    _MOD
    -0.06
     disorder
    -0.06
    POSITIVE LOGITS
    =num
    0.07
     Mrs
    0.07
    ,length
    0.07
     debuted
    0.06
    vál
    0.06
     exc
    0.06
     Fiesta
    0.06
     nm
    0.06
     headphone
    0.06
     arrested
    0.06
    Act Density 0.001%

    No Known Activations