INDEX
    Explanations

    imagining or pretending

    Detecting role‑play or instruction prompts that directly address the model (second‑person setup statements like "imagine/you are..." defining a role or task).

    New Auto-Interp
    Negative Logits
    ----------</
    -0.08
    _tac
    -0.07
    消费品
    -0.07
     boobs
    -0.07
    ::.
    -0.07
    .navigateByUrl
    -0.06
     shorts
    -0.06
    太多了
    -0.06
    getText
    -0.06
    /Math
    -0.06
    POSITIVE LOGITS
     Boss
    0.08
     LR
    0.07
     IM
    0.07
     hunt
    0.07
    0.07
    RR
    0.07
     activating
    0.07
     releasing
    0.07
     setting
    0.06
    整个
    0.06
    Act Density 0.030%

    No Known Activations