引用本文: | 张俊玉,吴怡婷,夏俐,曹希仁.可数状态空间的平均成本马氏决策过程(英文)[J].控制理论与应用,2021,38(11):1707~1716.[点击复制] |
ZHANG Jun-yu,WU Yi-ting,XIA Li,CAO Xi-Ren.Average cost Markov decision processes with countable state spaces[J].Control Theory and Technology,2021,38(11):1707~1716.[点击复制] |
|
可数状态空间的平均成本马氏决策过程(英文) |
Average cost Markov decision processes with countable state spaces |
摘要点击 2416 全文点击 570 投稿时间:2021-08-20 修订日期:2021-11-16 |
查看全文 查看/发表评论 下载PDF阅读器 |
DOI编号 10.7641/CTA.2021.10763 |
2021,38(11):1707-1716 |
中文关键词 马尔可夫决策过程 平均准则 可数状态空间 Dynkin公式 泊松方程 性能敏感 |
英文关键词 Markov decision process long-run average countable state spaces Dynkin’s formula Poisson equation performance sensitivity |
基金项目 Supported by the National Natural Science Foundation of China (61673019, 61773411, 11931018, 62073346), the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032) and the Guangdong Basic and Applied Basic Research Foundation (2021A1515010057, 2021A1515011984). |
|
中文摘要 |
具有可数状态空间的马尔可夫决策过程(Markov decision process, MDP)在平均准则下, 最优(平稳)策略不一定
存在. 本文研究平均准则可数状态MDP中满足最优不等式的最优策略. 不同于消去折扣(因子)方法, 利用离散的
Dynkin公式推导本文的主要结果. 首先给出遍历马氏链的泊松方程和两个零常返马氏链的例子, 证明了满足两个方向
相反的最优不等式的最优策略存在性. 其次, 通过两个比较引理和性能差分公式, 证明了正常返链和多链最优策略的存
在性, 并进一步推广到其他情形. 特别地, 本文通过几个应用举例, 说明平均准则性能敏感的本质. 本文的结果完善了可
数状态MDP在平均准则下的最优不等式的理论. |
英文摘要 |
For the long-run average of a Markov decision process (MDP) with countable state spaces, the optimal (stationary)
policy may not exist. In this paper, we study the optimal policies satisfying optimality inequality in a countable-state
MDP under the long-run average criterion. Different from the vanishing discount approach, we use the discrete Dynkin’s
formula to derive the main results of this paper. We first provide the Poisson equation of an ergodic Markov chain and two
instructive examples about null recurrent Markov chains, and demonstrate the existence of optimal policies for two optimality
inequalities with opposite directions. Then, from two comparison lemmas and the performance difference formula, we
prove the existence of optimal policies under positive recurrent chains and multi-chains, which is further extended to other
situations. Especially, several examples of applications are provided to illustrate the essential of performance sensitivity of
the long-run average. Our results make a supplement to the literature work on the optimality inequality of average MDPs
with countable states. |
|
|
|
|
|