INDEX
    Explanations

    informal writing

    responses that assert control and obedience along with prompts for unethical or harmful content.

    New Auto-Interp
    Negative Logits
     duk
    -0.07
    sis
    -0.06
    -0.06
     targetType
    -0.06
    йте
    -0.06
     al
    -0.06
    Generator
    -0.06
     peptides
    -0.06
    (original
    -0.06
    μ
    -0.06
    POSITIVE LOGITS
     ещё
    0.07
     ICT
    0.07
    CGRect
    0.07
    학교
    0.06
     dequeue
    0.06
    }↵↵
    0.06
     επί
    0.06
    ΗΝ
    0.06
     Agricultural
    0.06
    osterone
    0.06
    Act Density 0.033%

    No Known Activations