INDEX
    Explanations

    Code and text snippets

    The neuron fires on tokens from the policy/instruction header (e.g. words like “history,” “insult,” “competitive,” “innuendos,” etc.), i.e. it detects system‐level instruction or policy text rather than user content.

    New Auto-Interp
    Negative Logits
     harvest
    -0.07
    -0.06
    İN
    -0.06
    міністра
    -0.06
     лиц
    -0.06
     закін
    -0.06
    以上
    -0.06
    -0.06
    HAVE
    -0.06
    ٥
    -0.06
    POSITIVE LOGITS
    .mouse
    0.07
    看到
    0.06
     Mec
    0.06
     twisting
    0.06
     rv
    0.06
    .Hour
    0.06
     Sebastian
    0.06
     entrev
    0.06
    прав
    0.06
     cott
    0.06
    Act Density 0.024%

    No Known Activations