INDEX
    Explanations

    language model identity

    self-referential AI meta-discussion, i.e., passages where the assistant describes its identity as a language model, capabilities, limitations, safety policies, training, and operational context.

    New Auto-Interp
    Negative Logits
     foolproof
    0.80
    ampionship
    0.69
    Tort
    0.66
     fateful
    0.65
    Chọn
    0.63
    Locks
    0.63
    Arrow
    0.63
    Trap
    0.62
    发生在
    0.62
    0.62
    POSITIVE LOGITS
     chatbot
    0.90
     agréable
    0.82
     informatique
    0.81
     openai
    0.80
     язы
    0.80
     OpenAI
    0.80
     logiciels
    0.79
     numérique
    0.78
     jazy
    0.78
     chatbots
    0.78
    Act Density 0.115%

    No Known Activations