INDEX
    Explanations

    exercise in harmless exploration

    qualifying and hedging language that frames sensitive or prohibited requests as hypothetical, academic, or fictional within safety/disclaimer contexts.

    New Auto-Interp
    Negative Logits
    enste
    0.36
    ゅう
    0.36
    )}\
    0.34
    èces
    0.34
     definitely
    0.33
     ஹைட்
    0.33
     फाइल
    0.33
     négy
    0.33
     climb
    0.33
    Nodes
    0.32
    POSITIVE LOGITS
     harmless
    0.71
     legitimate
    0.61
     legitt
    0.59
     legít
    0.55
     lawful
    0.52
    あくまで
    0.49
     merely
    0.49
    lawful
    0.49
     innoc
    0.49
     benign
    0.48
    Act Density 0.583%

    No Known Activations