可数状态空间的平均成本马氏决策过程(英文)

张俊玉; 吴怡婷; 夏俐; 曹希仁

引用本文:	张俊玉,吴怡婷,夏俐,曹希仁.可数状态空间的平均成本马氏决策过程(英文)[J].控制理论与应用,2021,38(11):1707~1716.[点击复制]
	ZHANG Jun-yu,WU Yi-ting,XIA Li,CAO Xi-Ren.Average cost Markov decision processes with countable state spaces[J].Control Theory & Applications,2021,38(11):1707~1716.[点击复制]

可数状态空间的平均成本马氏决策过程(英文)

Average cost Markov decision processes with countable state spaces

摘要点击 2644 全文点击 590 投稿时间：2021-08-20 修订日期：2021-11-16

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2021.10763

2021,38(11):1707-1716

中文关键词马尔可夫决策过程平均准则可数状态空间 Dynkin公式泊松方程性能敏感

英文关键词 Markov decision process long-run average countable state spaces Dynkin’s formula Poisson equation performance sensitivity

基金项目 Supported by the National Natural Science Foundation of China (61673019, 61773411, 11931018, 62073346), the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032) and the Guangdong Basic and Applied Basic Research Foundation (2021A1515010057, 2021A1515011984).

作者	单位	邮编
张俊玉	中山大学	510275
吴怡婷	中山大学
夏俐	中山大学	510275
曹希仁^*	香港科技大学

中文摘要

具有可数状态空间的马尔可夫决策过程(Markov decision process, MDP)在平均准则下, 最优(平稳)策略不一定存在. 本文研究平均准则可数状态MDP中满足最优不等式的最优策略. 不同于消去折扣(因子)方法, 利用离散的 Dynkin公式推导本文的主要结果. 首先给出遍历马氏链的泊松方程和两个零常返马氏链的例子, 证明了满足两个方向相反的最优不等式的最优策略存在性. 其次, 通过两个比较引理和性能差分公式, 证明了正常返链和多链最优策略的存在性, 并进一步推广到其他情形. 特别地, 本文通过几个应用举例, 说明平均准则性能敏感的本质. 本文的结果完善了可数状态MDP在平均准则下的最优不等式的理论.

英文摘要

For the long-run average of a Markov decision process (MDP) with countable state spaces, the optimal (stationary) policy may not exist. In this paper, we study the optimal policies satisfying optimality inequality in a countable-state MDP under the long-run average criterion. Different from the vanishing discount approach, we use the discrete Dynkin’s formula to derive the main results of this paper. We first provide the Poisson equation of an ergodic Markov chain and two instructive examples about null recurrent Markov chains, and demonstrate the existence of optimal policies for two optimality inequalities with opposite directions. Then, from two comparison lemmas and the performance difference formula, we prove the existence of optimal policies under positive recurrent chains and multi-chains, which is further extended to other situations. Especially, several examples of applications are provided to illustrate the essential of performance sensitivity of the long-run average. Our results make a supplement to the literature work on the optimality inequality of average MDPs with countable states.