引用本文: | 朱威,谯先锋,陈艺楷,何德峰.多步积累奖励的双重时序Q网络算法[J].控制理论与应用,2022,39(2):222~230.[点击复制] |
ZHU Wei,QIAO Xian-feng,CHEN Yi-kai,HE De-feng.Double time-series Q network algorithm with multi-step accumulation reward[J].Control Theory and Technology,2022,39(2):222~230.[点击复制] |
|
多步积累奖励的双重时序Q网络算法 |
Double time-series Q network algorithm with multi-step accumulation reward |
摘要点击 2058 全文点击 851 投稿时间:2021-01-21 修订日期:2021-11-24 |
查看全文 查看/发表评论 下载PDF阅读器 |
DOI编号 10.7641/CTA.2021.10077 |
2022,39(2):222-230 |
中文关键词 深度强化学习 无人车 多步积累奖励 时序网络 数据利用率 |
英文关键词 deep reinforcement learning unmanned vehicles multi-step reward time-series network data utilization |
基金项目 浙江省自然科学基金项目(LY21F010009), 国家自然科学基金项目(61773345), 汽车仿真与控制国家重点实验室开放基金项目(20171103)资助. |
|
中文摘要 |
车辆行驶控制决策是无人驾驶的核心技术, 现有基于深度强化学习的无人驾驶控制决策算法存在处理数
据效率低、无法有效提取状态间时序特征等问题. 因此本文提出了一种基于多步积累奖励的双重时序Q网络算法.
首先, 设计了一种多步积累奖励方法, 该方法对未来多步即时奖励的累加和进行均值化, 与当前即时奖励共同作用
于智能体的控制策略, 并在奖励函数中突出当前即时奖励的主导影响. 接着设计了一种长短期记忆网络和卷积神经
网络相结合的时序网络结构, 增强智能体对数据间时序特征的捕获能力. 实验结果验证了时序网络和多步积累奖励
方法有助于提升智能体收敛速度, 其中DQN, DDQN使用时序网络后, 收敛速度分别提升了21.9%, 26.8%; 本文算法
在Carla仿真平台典型的Town01, Town02场景中的控制得分比DDQN, TD3算法分别高了36.1%, 24.6%, 以及在复杂
的Town03场景中针对不同线路表现出了更好的泛化性能. 这些结果都表明本文算法能够有效的提升数据利用效率,
并具备良好的控制能力和泛化能力. |
英文摘要 |
Vehicle driving control decision-making is the core technology of unmanned driving. The existing unmanned
driving control decision-making algorithm based on deep reinforcement learning has problems such as low data processing
efficiency and ineffective extraction of sequential features between states. Therefore, this paper proposes a double timeseries
Q network algorithm based on multi-step accumulation of rewards. First, a multi-step accumulation reward method
is designed, which averages the cumulative sum of multi-step instant reward, and works with the current instant reward acts
on the control strategy of the intelligence. At the same time, the main influencing factors of current reward are highlighted
in the reward function. Then, a time-series network structure combining long short-term memory and convolutional neural
network is designed to enhance the ability of agent to capture time series features between data frames. The experimental
results show that the sequential network and the multi-step accumulation reward method can improve the convergence speed
of the agent. After adding the time series network to DQN and DDQN, their convergence speeds are increased by 21.9%
and 26.8%, respectively. Compared with DDQN and TD3, the control scores of the proposed algorithm in typical scenes
of Town01 and Town02 of Carla simulation platform are increased by 36.1% and 24.6%, respectively. In addition, in the
complex Town03 scene, the proposed algorithm shows better generalization performance for different routes. These results
show that the proposed algorithm can effectively improve the efficiency of data utilization, and has good control ability and
generalization ability. |
|
|
|
|
|