INDEX
    Explanations

    The neuron detects system‐ or user‐provided role‐playing directives—phrases instructing the assistant how to stay “in character,” speak, or behave.

    New Auto-Interp
    Negative Logits
     plastics
    -0.07
    -covered
    -0.06
     trusting
    -0.06
     rampant
    -0.06
    lse
    -0.06
     Part
    -0.06
     hunts
    -0.06
    _partition
    -0.06
    agedList
    -0.06
     Notice
    -0.06
    POSITIVE LOGITS
    ,right
    0.07
     waterproof
    0.06
     معت
    0.06
    Пер
    0.06
    setScale
    0.06
     süt
    0.06
     EF
    0.06
    :'+
    0.06
    نين
    0.06
    _af
    0.06
    Act Density 0.027%

    No Known Activations