INDEX
    Explanations

    Chinese health queries and JSON keywords

    sentences where the assistant asserts it's a safe/helpful AI and refuses or explains why it cannot comply (safety/ refusal boilerplate).

    New Auto-Interp
    Negative Logits
     pumpkin
    0.54
     PACKAGE
    0.50
    ERSHIP
    0.50
    ាត់
    0.50
    0.48
     crispy
    0.48
     avoidable
    0.48
     ach
    0.46
    inars
    0.46
    اك
    0.46
    POSITIVE LOGITS
    z
    0.57
    ک
    0.57
    gpt
    0.56
    Convers
    0.55
    0.54
    María
    0.52
    O
    0.52
    Robot
    0.52
    mig
    0.52
    openai
    0.52
    Act Density 2.839%

    No Known Activations