引用本文: | 唐波,李衍杰,殷保群.连续时间部分可观Markov决策过程的策略梯度估计[J].控制理论与应用,2009,26(7):805~808.[点击复制] |
TANG Bo,LI Yan-jie,YIN Bao-qun.The policy gradient estimation for continuous-time partially observable Markovian decision processes[J].Control Theory and Technology,2009,26(7):805~808.[点击复制] |
|
连续时间部分可观Markov决策过程的策略梯度估计 |
The policy gradient estimation for continuous-time partially observable Markovian decision processes |
摘要点击 2099 全文点击 1472 投稿时间:2008-03-26 修订日期:2008-08-30 |
查看全文 查看/发表评论 下载PDF阅读器 |
DOI编号 10.7641/j.issn.1000-8152.2009.7.CCTA080248 |
2009,26(7):805-808 |
中文关键词 连续时间部分可观Markov决策过程 策略梯度估计 一致化 误差界 |
英文关键词 CTPOMDP policy gradient estimation conformity error bound |
基金项目 国家自然科学基金资助项目(60574065); 国家“863”计划资助项目(2006AA01Z114); 中国科学院自动化所和中国科学技术大学智能科学与技术联合实验室种子基金资助项目(JL0606). |
|
中文摘要 |
针对连续时间部分可观Markov决策过程(CTPOMDP)的优化问题,本文提出一种策略梯度估计方法. 运用一致化方法,将离散时间部分可观Markov决策过程(DTPOMDP)的梯度估计算法推广到连续时间模型, 研究了算法的收敛性和误差估计问题,并用一个数值例子来说明该算法的应用. |
英文摘要 |
An algorithm for estimating the policy gradient is presented for the performance optimization of continuoustime partially observable Markovian decision processes(CTPOMDPs). This estimation algorithm is obtained by extending the corresponding estimation algorithm for discrete-time partially observable Markovian decision processes(DTPOMDP’s), using the conformity method. The convergence and the error bound of this algorithm are analyzed; and a numerical example is provided to illustrate its application. |