INDEX
    Explanations

    The neuron is detecting gerund verbs (–ing action words) that describe potentially harmful or disallowed actions in user instructions.

    New Auto-Interp
    Negative Logits
    getPosition
    -0.07
     Sets
    -0.07
     sets
    -0.07
     SET
    -0.07
     DATA
    -0.07
    able
    -0.07
     dogs
    -0.07
    olicited
    -0.06
     CF
    -0.06
    ainty
    -0.06
    POSITIVE LOGITS
    recipient
    0.07
    pie
    0.07
     apl
    0.07
     Geographic
    0.07
    Twitter
    0.07
     jede
    0.06
    virt
    0.06
     decrement
    0.06
     kl
    0.06
    /DTD
    0.06
    Act Density 0.015%

    No Known Activations