结合优势结构和最小目标Q值的深度强化学习导航算法

朱威; 洪力栋; 施海东; 何德峰

引用本文:	朱威,洪力栋,施海东,何德峰.结合优势结构和最小目标Q值的深度强化学习导航算法[J].控制理论与应用,2024,41(4):716~728.[点击复制]
	ZHU Wei,HONG Li-dong,SHI Hai-dong,HE De-feng.Deep reinforcement learning navigation algorithm combining advantage structure and minimum target Q-value[J].Control Theory and Technology,2024,41(4):716~728.[点击复制]

结合优势结构和最小目标Q值的深度强化学习导航算法

Deep reinforcement learning navigation algorithm combining advantage structure and minimum target Q-value

摘要点击 3044 全文点击 259 投稿时间：2022-04-19 修订日期：2023-11-11

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2023.20293

2024,41(4):716-728

中文关键词强化学习移动机器人导航优势结构最小化目标Q值

英文关键词 reinforcement learning mobile robot navigation advantage structure minimize target Q-Value

基金项目国家自然科学基金项目(62173303), 浙江省自然科学基金项目(LY21F010009)

作者	单位	E-mail
朱威^*	浙江工业大学	weizhu@zjut.edu.cn
洪力栋	浙江工业大学
施海东	浙江工业大学
何德峰	浙江工业大学

中文摘要

针对现有基于策略梯度的深度强化学习方法应用于办公室、走廊等室内复杂场景下的机器人导航时, 存在训练时间长、学习效率低的问题, 本文提出了一种结合优势结构和最小化目标Q值的深度强化学习导航算法. 该算法将优势结构引入到基于策略梯度的深度强化学习算法中, 以区分同一状态价值下的动作差异, 提升学习效率, 并且在多目标导航场景中, 对状态价值进行单独估计, 利用地图信息提供更准确的价值判断. 同时, 针对离散控制中缓解目标Q值过估计方法在强化学习主流的Actor-Critic框架下难以奏效, 设计了基于高斯平滑的最小目标Q值方法,以减小过估计对训练的影响. 实验结果表明本文算法能够有效加快学习速率, 在单目标、多目标连续导航训练过程中, 收敛速度上都优于柔性演员评论家算法 (SAC), 双延迟深度策略性梯度算法 (TD3), 深度确定性策略梯度算法(DDPG), 并使移动机器人有效远离障碍物, 训练得到的导航模型具备较好的泛化能力

英文摘要

The existing deep reinforcement learning methods based on the policy gradients have the problems of long training time and low learning efficiency when they are applied to robot navigation in complex indoor scenes such as offices and corridors. This paper proposes a deep reinforcement learning navigation algorithm which combines the advantage structure and minimizing the target Q value. The algorithm introduces the advantage structure into the deep reinforcement learning method based on the policy gradient to distinguish the action difference under the same state value and improve the learning efficiency. In the multi-target navigation scenario, the method estimates the state value separately to provide more accurate value judgment by using map information. The mitigation over estimation method for discrete control is difficult to work in the mainstream Actor-Critic framework, a minimum target Q-value method based on the Gaussian smoothing is designed to reduce the influence of over estimation on training, The experimental results show that the algorithm in this paper can effectively speed up the learning rate. In the process of single-target and multi-target continuous navigation training, the convergence speed of our method is better than that of SAC, TD3, and DDPG. The trained agent makes the robot effectively away from obstacles and has a good generalization ability.