连续时间部分可观Markov决策过程的策略梯度估计

唐波; 李衍杰; 殷保群

引用本文:	唐波,李衍杰,殷保群.连续时间部分可观Markov决策过程的策略梯度估计[J].控制理论与应用,2009,26(7):805~808.[点击复制]
	TANG Bo,LI Yan-jie,YIN Bao-qun.The policy gradient estimation for continuous-time partially observable Markovian decision processes[J].Control Theory & Applications,2009,26(7):805~808.[点击复制]

连续时间部分可观Markov决策过程的策略梯度估计

The policy gradient estimation for continuous-time partially observable Markovian decision processes

摘要点击 2293 全文点击 1495 投稿时间：2008-03-26 修订日期：2008-08-30

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/j.issn.1000-8152.2009.7.CCTA080248

2009,26(7):805-808

中文关键词连续时间部分可观Markov决策过程策略梯度估计一致化误差界

英文关键词 CTPOMDP policy gradient estimation conformity error bound

基金项目国家自然科学基金资助项目(60574065); 国家“863”计划资助项目(2006AA01Z114); 中国科学院自动化所和中国科学技术大学智能科学与技术联合实验室种子基金资助项目(JL0606).

作者	单位	E-mail
唐波^*	中国科学技术大学自动化系	ttb96620@mail.ustc.edu.cn
李衍杰	中国科学技术大学自动化系
殷保群	中国科学技术大学自动化系

中文摘要

针对连续时间部分可观Markov决策过程(CTPOMDP)的优化问题,本文提出一种策略梯度估计方法. 运用一致化方法,将离散时间部分可观Markov决策过程(DTPOMDP)的梯度估计算法推广到连续时间模型, 研究了算法的收敛性和误差估计问题,并用一个数值例子来说明该算法的应用.

英文摘要

An algorithm for estimating the policy gradient is presented for the performance optimization of continuoustime partially observable Markovian decision processes(CTPOMDPs). This estimation algorithm is obtained by extending the corresponding estimation algorithm for discrete-time partially observable Markovian decision processes(DTPOMDP’s), using the conformity method. The convergence and the error bound of this algorithm are analyzed; and a numerical example is provided to illustrate its application.