INDEX
    Explanations

    This neuron detects when the text is asking the model to assume or play a “role” (i.e. explicit role-playing instructions).

    New Auto-Interp
    Negative Logits
     Kết
    -0.06
    ECT
    -0.06
    的に
    -0.06
    minute
    -0.06
     Dickinson
    -0.06
     Leopard
    -0.06
     Bytes
    -0.06
     نوشته
    -0.06
     interruption
    -0.06
     NPR
    -0.06
    POSITIVE LOGITS
    (assert
    0.06
    ()==
    0.06
    ín
    0.06
    oppable
    0.06
    =*
    0.06
     wasn
    0.06
    ASP
    0.06
    =l
    0.06
    ==(
    0.06
    .setResult
    0.06
    Act Density 0.017%

    No Known Activations