INDEX
    Explanations

    negative statements

    The neuron activates on user instructions asking the assistant to “say something mean to me” or otherwise solicit insults.

    New Auto-Interp
    Negative Logits
     expand
    -0.07
     Epid
    -0.07
     freedom
    -0.07
     detain
    -0.06
     eye
    -0.06
    ↵		↵
    -0.06
     satisfaction
    -0.06
     nearly
    -0.06
     selection
    -0.06
    ่าย
    -0.06
    POSITIVE LOGITS
    ','=',
    0.08
    请选择
    0.07
     commentaire
    0.07
     aria
    0.06
    ulario
    0.06
     kuv
    0.06
     Vinyl
    0.06
     Usa
    0.06
     Микола
    0.06
    Eu
    0.06
    Act Density 0.046%

    No Known Activations