INDEX
    Explanations

    The neuron activates on polite closing or congratulatory words—especially “helpful”—and the accompanying exclamation mark in the assistant’s upbeat wrap‐up sentences.

    New Auto-Interp
    Negative Logits
    وث
    -0.07
     simult
    -0.06
     coll
    -0.06
    WATCH
    -0.06
     payoff
    -0.06
     networks
    -0.06
    _bad
    -0.06
     Defense
    -0.06
     west
    -0.06
     strap
    -0.06
    POSITIVE LOGITS
    ประกอบ
    0.07
     откры
    0.07
    operator
    0.06
    ンの
    0.06
    ческая
    0.06
     теор
    0.06
    _sprite
    0.06
    0.06
     дити
    0.06
    /order
    0.06
    Act Density 0.011%

    No Known Activations