多步积累奖励的双重时序Q网络算法

朱威; 谯先锋; 陈艺楷; 何德峰

引用本文:	朱威,谯先锋,陈艺楷,何德峰.多步积累奖励的双重时序Q网络算法[J].控制理论与应用,2022,39(2):222~230.[点击复制]
	ZHU Wei,QIAO Xian-feng,CHEN Yi-kai,HE De-feng.Double time-series Q network algorithm with multi-step accumulation reward[J].Control Theory and Technology,2022,39(2):222~230.[点击复制]

多步积累奖励的双重时序Q网络算法

Double time-series Q network algorithm with multi-step accumulation reward

摘要点击 2058 全文点击 851 投稿时间：2021-01-21 修订日期：2021-11-24

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2021.10077

2022,39(2):222-230

中文关键词深度强化学习无人车多步积累奖励时序网络数据利用率

英文关键词 deep reinforcement learning unmanned vehicles multi-step reward time-series network data utilization

基金项目浙江省自然科学基金项目(LY21F010009), 国家自然科学基金项目(61773345), 汽车仿真与控制国家重点实验室开放基金项目(20171103)资助.

作者	单位	E-mail
朱威^*	浙江工业大学	weizhu@zjut.edu.cn
谯先锋	浙江工业大学
陈艺楷	浙江工业大学
何德峰	浙江工业大学

中文摘要

车辆行驶控制决策是无人驾驶的核心技术, 现有基于深度强化学习的无人驾驶控制决策算法存在处理数据效率低、无法有效提取状态间时序特征等问题. 因此本文提出了一种基于多步积累奖励的双重时序Q网络算法. 首先, 设计了一种多步积累奖励方法, 该方法对未来多步即时奖励的累加和进行均值化, 与当前即时奖励共同作用于智能体的控制策略, 并在奖励函数中突出当前即时奖励的主导影响. 接着设计了一种长短期记忆网络和卷积神经网络相结合的时序网络结构, 增强智能体对数据间时序特征的捕获能力. 实验结果验证了时序网络和多步积累奖励方法有助于提升智能体收敛速度, 其中DQN, DDQN使用时序网络后, 收敛速度分别提升了21.9%, 26.8%; 本文算法在Carla仿真平台典型的Town01, Town02场景中的控制得分比DDQN, TD3算法分别高了36.1%, 24.6%, 以及在复杂的Town03场景中针对不同线路表现出了更好的泛化性能. 这些结果都表明本文算法能够有效的提升数据利用效率, 并具备良好的控制能力和泛化能力.

英文摘要

Vehicle driving control decision-making is the core technology of unmanned driving. The existing unmanned driving control decision-making algorithm based on deep reinforcement learning has problems such as low data processing efficiency and ineffective extraction of sequential features between states. Therefore, this paper proposes a double timeseries Q network algorithm based on multi-step accumulation of rewards. First, a multi-step accumulation reward method is designed, which averages the cumulative sum of multi-step instant reward, and works with the current instant reward acts on the control strategy of the intelligence. At the same time, the main influencing factors of current reward are highlighted in the reward function. Then, a time-series network structure combining long short-term memory and convolutional neural network is designed to enhance the ability of agent to capture time series features between data frames. The experimental results show that the sequential network and the multi-step accumulation reward method can improve the convergence speed of the agent. After adding the time series network to DQN and DDQN, their convergence speeds are increased by 21.9% and 26.8%, respectively. Compared with DDQN and TD3, the control scores of the proposed algorithm in typical scenes of Town01 and Town02 of Carla simulation platform are increased by 36.1% and 24.6%, respectively. In addition, in the complex Town03 scene, the proposed algorithm shows better generalization performance for different routes. These results show that the proposed algorithm can effectively improve the efficiency of data utilization, and has good control ability and generalization ability.