INDEX
    Explanations

    AI self-description using "but" and "I"

    sentences or passages where the assistant introduces itself or describes its identity, training, capabilities, and availability.

    The neuron is essentially flagging the assistant’s own “long‐form” explanation turns (the multi‐paragraph, bullet‐list responses) as opposed to user utterances. In other words, it turns on for tokens in the model’s detailed breakdowns.

    New Auto-Interp
    Negative Logits
    我們要
    0.51
    !!!!!!!!!!!!!!!!
    0.49
    Fuck
    0.47
     vimos
    0.46
     нам
    0.46
     fucking
    0.45
    !!!!!!!!
    0.45
     consiglio
    0.45
     imo
    0.44
     veamos
    0.44
    POSITIVE LOGITS
     Developers
    0.63
     Capabilities
    0.63
    Unlike
    0.60
     OpenAI
    0.59
    Developers
    0.59
     myself
    0.58
     developers
    0.58
    Capabilities
    0.57
     capabilities
    0.56
     Unlike
    0.56
    Act Density 0.462%

    No Known Activations