INDEX
    Explanations

    The neuron flags the assistant’s self-limiting or refusal language—tokens like “não posso,” “posso não,” “não tenho” that express inability or refusal.

    New Auto-Interp
    Negative Logits
    -0.07
    _basename
    -0.06
    ilo
    -0.06
     elo
    -0.06
    -0.06
    _coin
    -0.06
     içeren
    -0.06
    ไหน
    -0.06
    ,name
    -0.06
     coatings
    -0.06
    POSITIVE LOGITS
     реж
    0.07
    	signal
    0.06
    -value
    0.06
     Steam
    0.06
    (INFO
    0.06
     reflects
    0.06
    0.06
    manent
    0.06
    _locations
    0.06
     structural
    0.06
    Act Density 0.026%

    No Known Activations