INDEX
    Explanations

    punctuation and code

    This neuron detects tokens involved in defining or assigning the assistant’s persona or role (e.g. “NAME_1,” “author,” and similar meta‐instruction placeholders).

    New Auto-Interp
    Negative Logits
     zag
    -0.07
    mnop
    -0.06
    サー
    -0.06
    -but
    -0.06
     하지만
    -0.06
    alah
    -0.06
    ilos
    -0.06
    kan
    -0.06
    анные
    -0.06
    -0.06
    POSITIVE LOGITS
     pineapple
    0.08
     incidence
    0.07
     Braun
    0.06
     BEGIN
    0.06
    Information
    0.06
     charity
    0.06
     calorie
    0.06
    ategorized
    0.06
     ček
    0.06
     univerz
    0.06
    Act Density 0.001%

    No Known Activations