INDEX
    Explanations

    The neuron fires on tokens in which the assistant is refusing or expressing inability (e.g. “I’m sorry,” “cannot,” “unable,” “decline”), i.e. it detects refusal-style language.

    New Auto-Interp
    Negative Logits
     guarante
    -0.06
    tl
    -0.06
    .drive
    -0.06
    ru
    -0.06
    cerr
    -0.06
    ")↵
    -0.06
    antity
    -0.06
    .RIGHT
    -0.06
    it
    -0.06
    print
    -0.06
    POSITIVE LOGITS
     Александ
    0.07
     Nicholas
    0.07
    /forum
    0.07
     관리자
    0.06
     Sek
    0.06
    larla
    0.06
     Серг
    0.06
     sek
    0.06
    0.06
    .wik
    0.06
    Act Density 0.014%

    No Known Activations