INDEX
Explanations
exercise in harmless exploration
qualifying and hedging language that frames sensitive or prohibited requests as hypothetical, academic, or fictional within safety/disclaimer contexts.
New Auto-Interp
Negative Logits
enste
0.36
ゅう
0.36
)}\
0.34
èces
0.34
definitely
0.33
ஹைட்
0.33
फाइल
0.33
négy
0.33
climb
0.33
Nodes
0.32
POSITIVE LOGITS
harmless
0.71
legitimate
0.61
legitt
0.59
legít
0.55
lawful
0.52
あくまで
0.49
merely
0.49
lawful
0.49
innoc
0.49
benign
0.48
Activations Density 0.583%