INDEX
    Explanations

    tokens that occur in instruction/task-setting prompts (imperative or role directives), i.e., words used when the user tells the model what to do.

    New Auto-Interp
    Negative Logits
    Identity
    -0.06
    Date
    -0.06
    Qui
    -0.06
    515
    -0.06
    Uploader
    -0.06
    문의
    -0.06
    772
    -0.06
    -0.06
    ificant
    -0.06
    bindung
    -0.06
    POSITIVE LOGITS
    hint
    0.07
    /basic
    0.07
    *z
    0.07
    _START
    0.06
    _stdio
    0.06
    _SELECTOR
    0.06
    -notch
    0.06
     allies
    0.06
    _LABEL
    0.06
    0.06
    Act Density 0.107%

    No Known Activations