INDEX
    Explanations

    The neuron fires on apology expressions (e.g. “I apologize,” “sorry,” etc.), signaling when the model is offering a regretful or apologetic response.

    New Auto-Interp
    Negative Logits
    “No
    -0.07
    -0.07
    -sidebar
    -0.07
    brakk
    -0.06
    ku
    -0.06
    _PATH
    -0.06
    قام
    -0.06
    ourt
    -0.06
    541
    -0.06
     Damascus
    -0.06
    POSITIVE LOGITS
    0.07
    trajectory
    0.07
    _sel
    0.07
    Isn
    0.06
     cleanly
    0.06
    Ont
    0.06
     HTMLElement
    0.06
    shares
    0.06
     textStyle
    0.06
    NSS
    0.06
    Act Density 0.005%

    No Known Activations