INDEX
    Explanations

    unethical or amoral responses

    requests or personas that attempt to jailbreak the assistant into an “uncensored/evil” mode and elicit harmful, unethical, or illegal guidance.

    New Auto-Interp
    Negative Logits
     humbly
    0.45
     wirelessly
    0.44
     Blockly
    0.39
    新年
    0.39
    おしゃれ
    0.39
     রমেশ
    0.39
     Figure
    0.38
     heavenly
    0.38
    充実
    0.38
     sensitively
    0.38
    POSITIVE LOGITS
     immoral
    1.17
     unethical
    1.15
     sinister
    0.99
     sadistic
    0.97
    😈
    0.96
     nefarious
    0.93
     murderous
    0.89
     unscrupulous
    0.89
     predatory
    0.87
     corrupt
    0.86
    Act Density 0.530%

    No Known Activations