INDEX
    Explanations

    instances of the assistant's standard self‑referential safety/disclaimer phrasing (e.g., "I'm sorry, but as an AI language model...").

    New Auto-Interp
    Negative Logits
    (job
    -0.06
    -0.06
    (state
    -0.06
     tạm
    -0.06
     unite
    -0.06
    (sample
    -0.06
    P
    -0.06
    /class
    -0.06
     GFP
    -0.06
     Рус
    -0.06
    POSITIVE LOGITS
     AI
    0.18
    AI
    0.11
    _AI
    0.08
     Treatment
    0.08
    Ai
    0.08
    _build
    0.07
    ertainty
    0.07
    .AI
    0.07
     Ethernet
    0.07
    .ai
    0.07
    Act Density 0.026%

    No Known Activations