INDEX
Explanations
unethical or amoral responses
requests or personas that attempt to jailbreak the assistant into an “uncensored/evil” mode and elicit harmful, unethical, or illegal guidance.
New Auto-Interp
Negative Logits
humbly
0.45
wirelessly
0.44
Blockly
0.39
新年
0.39
おしゃれ
0.39
রমেশ
0.39
Figure
0.38
heavenly
0.38
充実
0.38
sensitively
0.38
POSITIVE LOGITS
immoral
1.17
unethical
1.15
sinister
0.99
sadistic
0.97
😈
0.96
nefarious
0.93
murderous
0.89
unscrupulous
0.89
predatory
0.87
corrupt
0.86
Activations Density 0.530%