INDEX
Explanations
The neuron flags the assistant’s self-limiting or refusal language—tokens like “não posso,” “posso não,” “não tenho” that express inability or refusal.
New Auto-Interp
Negative Logits
肉
-0.07
_basename
-0.06
ilo
-0.06
elo
-0.06
�
-0.06
_coin
-0.06
içeren
-0.06
ไหน
-0.06
,name
-0.06
coatings
-0.06
POSITIVE LOGITS
реж
0.07
signal
0.06
-value
0.06
Steam
0.06
(INFO
0.06
reflects
0.06
�
0.06
manent
0.06
_locations
0.06
structural
0.06
Activations Density 0.026%