引用本文:吴培良,袁旭东,毛秉毅,陈雯柏,高国伟.多智能体双注意力自适应熵深度强化学习[J].控制理论与应用,2024,41(10):1930~1936.[点击复制]
WU Pei-liang,YUAN Xu-dong,Mao Bing-yi,Chen Wen-bai,GAO Guo-wei.Multi-agent deep reinforcement learning via double attention and adaptive entropy[J].Control Theory and Technology,2024,41(10):1930~1936.[点击复制]
多智能体双注意力自适应熵深度强化学习
Multi-agent deep reinforcement learning via double attention and adaptive entropy
摘要点击 2344  全文点击 69  投稿时间:2022-11-20  修订日期:2023-06-04
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/CTA.2023.21023
  2024,41(10):1930-1936
中文关键词  多智能体系统  强化学习  注意力机制  自适应熵  执行–评价
英文关键词  multi-agent systems  reinforcement learning  attention  adaptive entropy  actor-critic
基金项目  国家重点研发计划项目(2018YFB1308300), 国家自然科学基金项目(62276028, U20A20167), 北京市自然科学基金项目(4202026), 河北省自然 科学基金项目(F202103079), 河北省创新能力提升计划项目(22567626H)资助.
作者单位E-mail
吴培良* 燕山大学 peiliangwu@ysu.edu.cn 
袁旭东 燕山大学  
毛秉毅 燕山大学  
陈雯柏 北京信息科技大学  
高国伟 北京信息科技大学  
中文摘要
      在执行–评价算法和最大熵强化学习算法中分别存在价值函数过高估计和温度参数脆弱性的问题, 从而导致策略网络陷入局部最优. 针对此问题, 本文提出了一种基于双集中注意力机制与自适应温度参数的多智能体强化学习算法. 首先, 要构建出两个初始参数不同的具有注意力机制的评价网络, 通过这两个评价网络对策略网络做出更加准确的评价, 从而避免出现过高估计问题而导致策略网络陷入局部最优. 其次, 本文提出了自适应温度参数的最大熵强化学习算法, 计算出每个智能体的策略熵和基线熵, 从而动态调整温度参数以实现自适应调整智能体的探索. 最后, 在受限的合作导航环境和受限的宝藏收集环境中验证了本文算法的有效性, 本文算法的平均总成本与平均总惩罚优于其他算法.
英文摘要
      In actor-critic algorithm and maximum entropy reinforcement learning, there are problems of overestimation of value function and fragility of temperature parameter, which lead to the policy network falling into local optimization. To solve this problem, an algorithm based on double attention mechanism and adaptive temperature parameters is proposed in this paper. First, two networks of attention critics with different initial parameters are constructed to make more accurate evaluation of the policy network, so as to avoid overestimation problems that cause the policy network to fall into local optimization. Secondly, a maximum entropy reinforcement learning algorithm for adaptive temperature parameters is proposed, which calculates the policy entropy and baseline entropy of each agent, and dynamically adjusts the temperature parameters to achieve the exploration of adaptive adjustment of agents. Finally, the effectiveness of our algorithm is verified in the constrained cooperative navigation environment and the constrained treasure collection environment. The average total cost and average total penalty of our algorithm are superior to other algorithms.