| 引用本文: | 庞舟岐,郝程鹏,林晓波,潘光帅.基于深度强化学习的水下无人航行器高速目标捕获路径规划[J].控制理论与应用,2025,42(10):1968~1980.[点击复制] |
| PANG Zhou-qi,HAO Cheng-peng,LIN Xiao-bo,PAN Guang-shuai.High speed target acquisition path planning for underwater unmanned vehicles based on deep reinforcement learning[J].Control Theory & Applications,2025,42(10):1968~1980.[点击复制] |
|
| 基于深度强化学习的水下无人航行器高速目标捕获路径规划 |
| High speed target acquisition path planning for underwater unmanned vehicles based on deep reinforcement learning |
| 摘要点击 389 全文点击 56 投稿时间:2024-05-13 修订日期:2025-08-02 |
| 查看全文 查看/发表评论 下载PDF阅读器 |
| DOI编号 10.7641/CTA.2019.90328 |
| 2025,42(10):1968-1980 |
| 中文关键词 深度强化学习 确定性策略梯度 高速目标捕获 水下无人航行器 马尔可夫决策过程 |
| 英文关键词 deep reinforcement learning deterministic policy gradient high speed target acquisition underwater un manned vehicle Markov decision process |
| 基金项目 国家自然科学基金项目(61971412),中国科学院某实验室基金项目(CXJJ–22S025)资助. |
|
| 中文摘要 |
| 高速水下目标捕获问题存在诸多挑战,一方面,受水下多变的环境影响,声呐探测数据有较大的时延性和
不确定性;另一方面,由于目标速度快,拦截器无法以追击姿态进行捕获,使得可拦截轨迹的数量大大减少.基于此,
本文提出了一种改进的双延时深度确定性策略梯度(ITD3)算法来提高拦截器的捕获效率和精度.首先,基于拦截器
动力学本文构建“规划器–控制器”级联仿真方式,相较于纯运动学仿真更精确,相较于制导控制一体模型更符合实
际情况;其次,为了解决动作空间较大以及水下传感器存在时延的问题,本文提出了动作掩膜机制并引入了基于时
延的探索噪声;再次,为使奖励函数契合高速目标捕获任务特点,本文设计了新的奖励函数对不利于捕获的状态进
行惩罚;最后,为提高算法的收敛速度和稳定性,本文在TD3算法的基础上融合优先级经验回放以及softmax操作符.
仿真实验和半实物仿真表明,和传统捕获算法相比,本文提出的ITD3算法捕获目标的时间更短、脱靶率更低,并有
着较强的可行性. |
| 英文摘要 |
| There are many challenges in high-speed underwater target acquisition. On the one hand, sonar detection data
is delayed and uncertain due to the changeable underwater environment, which makes high-precision target acquisition tasks
full of challenges; On the other hand, the intercepting vehicle is unable to capture in a pursuit attitude due to the high speed
of the target, greatly reducing the number of interceptable trajectories. Based on this, this article proposed an improved twin
delayed deep deterministic policy gradient algorithm (ITD3) to improve the acquisition efficiency and accuracy. Firstly,
based on the dynamics of the intercepting vehicle, this paper proposed a “planner-controller” cascaded simulation method,
which was more accurate than pure kinematic simulation and more in line with the actual situation compared to the IGC
model; Secondly, in order to solve the problems of large action space and delayed underwater sensors, this paper proposed
an action mask mechanism and exploring noise based on delayed messages; Thirdly, in order to make the reward function
f
it the characteristics of high-speed target acquisition task, this paper designed a new reward function to punish states which
were not conducive to capture; Finally, in order to improve the convergence speed and stability of the algorithm, this paper
combined priority experience replay and softmax operator with the TD3 algorithm. Simulation experiments and hardware
in-the-loop simulations showed that compared with traditional acquisition algorithms, the feasible ITD3 algorithm proposed
in this paper had a shorter interception time and a lower miss rate. |
|
|
|
|
|