INDEX
    Explanations

    varied text sources

    The neuron strongly fires on the assistant’s stock self-description “As an AI language model,” essentially detecting that exact self-referential phrase.

    New Auto-Interp
    Negative Logits
     transparent
    -0.07
    prefer
    -0.07
    chod
    -0.06
    zbek
    -0.06
    كات
    -0.06
    بان
    -0.06
    .layoutControlItem
    -0.06
     Translator
    -0.06
     pře
    -0.06
    Express
    -0.06
    POSITIVE LOGITS
     Amendment
    0.08
    -range
    0.07
     внес
    0.07
     прям
    0.06
    mektedir
    0.06
     OBJECT
    0.06
    раниц
    0.06
     memorable
    0.06
    ươ
    0.06
     проблема
    0.06
    Act Density 0.010%

    No Known Activations