INDEX
    Explanations

    Avoiding spam or unnecessary content

    content related to sexual themes and adult situations.

    This neuron activates on imperative instruction words from the system guidelines (e.g., "explain," "repeat," "output," "irrelevant," "yourself," "answers"), effectively detecting parts of the policy that tell the model not to perform certain behaviors.

    New Auto-Interp
    Negative Logits
    fine
    -0.06
    _master
    -0.06
     nomine
    -0.06
    Pass
    -0.06
     ICON
    -0.06
    мін
    -0.06
     clips
    -0.06
     nine
    -0.06
     stitch
    -0.06
     intr
    -0.06
    POSITIVE LOGITS
    _ComCallableWrapper
    0.07
     gerek
    0.07
    \E
    0.07
    /|
    0.06
    action
    0.06
    codec
    0.06
    ократи
    0.06
    (Api
    0.06
    ็ว
    0.06
     рань
    0.06
    Act Density 0.001%

    No Known Activations