INDEX
    Explanations

    information, instructions, tools

    assistant refusal or safety-policy language—responses that decline to provide information on illegal, dangerous, or inappropriate requests.

    New Auto-Interp
    Negative Logits
    hea
    -0.07
    headed
    -0.07
    ARED
    -0.06
     Guill
    -0.06
    rijk
    -0.06
    _lvl
    -0.06
    уру
    -0.06
    ared
    -0.06
    -0.06
     правил
    -0.06
    POSITIVE LOGITS
    Cb
    0.07
     UN
    0.06
    (rot
    0.06
     SJ
    0.06
     expanding
    0.06
     stem
    0.06
     picked
    0.06
     SU
    0.06
    Broker
    0.06
     muscle
    0.06
    Act Density 0.034%

    No Known Activations