引用本文:庞舟岐,郝程鹏,林晓波,潘光帅.基于深度强化学习的水下无人航行器高速目标捕获路径规划[J].控制理论与应用,2025,42(10):1968~1980.[点击复制]
PANG Zhou-qi,HAO Cheng-peng,LIN Xiao-bo,PAN Guang-shuai.High speed target acquisition path planning for underwater unmanned vehicles based on deep reinforcement learning[J].Control Theory & Applications,2025,42(10):1968~1980.[点击复制]
基于深度强化学习的水下无人航行器高速目标捕获路径规划
High speed target acquisition path planning for underwater unmanned vehicles based on deep reinforcement learning
摘要点击 389  全文点击 56  投稿时间:2024-05-13  修订日期:2025-08-02
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/CTA.2019.90328
  2025,42(10):1968-1980
中文关键词  深度强化学习  确定性策略梯度  高速目标捕获  水下无人航行器  马尔可夫决策过程
英文关键词  deep reinforcement learning  deterministic policy gradient  high speed target acquisition  underwater un manned vehicle  Markov decision process
基金项目  国家自然科学基金项目(61971412),中国科学院某实验室基金项目(CXJJ–22S025)资助.
作者单位E-mail
庞舟岐 中国科学院声学研究所 772369640@qq.com 
郝程鹏* 中国科学院声学研究所 haochengpeng123@sina.com 
林晓波 中国科学院声学研究所  
潘光帅 中国科学院声学研究所  
中文摘要
      高速水下目标捕获问题存在诸多挑战,一方面,受水下多变的环境影响,声呐探测数据有较大的时延性和 不确定性;另一方面,由于目标速度快,拦截器无法以追击姿态进行捕获,使得可拦截轨迹的数量大大减少.基于此, 本文提出了一种改进的双延时深度确定性策略梯度(ITD3)算法来提高拦截器的捕获效率和精度.首先,基于拦截器 动力学本文构建“规划器–控制器”级联仿真方式,相较于纯运动学仿真更精确,相较于制导控制一体模型更符合实 际情况;其次,为了解决动作空间较大以及水下传感器存在时延的问题,本文提出了动作掩膜机制并引入了基于时 延的探索噪声;再次,为使奖励函数契合高速目标捕获任务特点,本文设计了新的奖励函数对不利于捕获的状态进 行惩罚;最后,为提高算法的收敛速度和稳定性,本文在TD3算法的基础上融合优先级经验回放以及softmax操作符. 仿真实验和半实物仿真表明,和传统捕获算法相比,本文提出的ITD3算法捕获目标的时间更短、脱靶率更低,并有 着较强的可行性.
英文摘要
      There are many challenges in high-speed underwater target acquisition. On the one hand, sonar detection data is delayed and uncertain due to the changeable underwater environment, which makes high-precision target acquisition tasks full of challenges; On the other hand, the intercepting vehicle is unable to capture in a pursuit attitude due to the high speed of the target, greatly reducing the number of interceptable trajectories. Based on this, this article proposed an improved twin delayed deep deterministic policy gradient algorithm (ITD3) to improve the acquisition efficiency and accuracy. Firstly, based on the dynamics of the intercepting vehicle, this paper proposed a “planner-controller” cascaded simulation method, which was more accurate than pure kinematic simulation and more in line with the actual situation compared to the IGC model; Secondly, in order to solve the problems of large action space and delayed underwater sensors, this paper proposed an action mask mechanism and exploring noise based on delayed messages; Thirdly, in order to make the reward function f it the characteristics of high-speed target acquisition task, this paper designed a new reward function to punish states which were not conducive to capture; Finally, in order to improve the convergence speed and stability of the algorithm, this paper combined priority experience replay and softmax operator with the TD3 algorithm. Simulation experiments and hardware in-the-loop simulations showed that compared with traditional acquisition algorithms, the feasible ITD3 algorithm proposed in this paper had a shorter interception time and a lower miss rate.