INDEX
    Explanations

    references to the model's identity or the phrase "As a large language model" (self‑referential model introductions).

    The neuron activates on the self‐referential “As a large language model” style disclaimer phrase.

    New Auto-Interp
    Negative Logits
     convirt
    0.42
     sehingga
    0.40
     wodurch
    0.40
     بنابراین
    0.37
    0.37
    に示す
    0.37
     випад
    0.37
     تركيب
    0.36
     ngunit
    0.36
     ପ୍ର
    0.35
    POSITIVE LOGITS
     indexRouter
    0.43
     étant
    0.42
    我都
    0.40
     being
    0.39
     YouTuber
    0.39
    having
    0.39
     having
    0.38
     itself
    0.38
    我会
    0.38
     lover
    0.37
    Act Density 0.069%

    No Known Activations