引用本文:刘健,赵恒一.基于自生成专家样本的探索增强算法[J].控制理论与应用,2023,40(3):485~492.[点击复制]
LIU Jian,ZHAO Heng-yi.Enhance exploration with self-generated expert samples[J].Control Theory and Technology,2023,40(3):485~492.[点击复制]
基于自生成专家样本的探索增强算法
Enhance exploration with self-generated expert samples
摘要点击 1667  全文点击 442  投稿时间:2021-06-27  修订日期:2022-08-13
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/CTA.2021.10552
  2023,40(3):485-492
中文关键词  深度强化学习  探索  专家样本  确定性策略
英文关键词  deep reinforcement learning  exploration  expert sample  deterministic policy
基金项目  国家自然科学基金项目(61906198), 江苏省自然科学基金项目(BK20190622)资助.
作者单位E-mail
刘健* 中国矿业大学 liujiansqjxt@126.com 
赵恒一 中国矿业大学  
中文摘要
      为进一步提高深度强化学习算法在连续动作环境中的探索能力, 以获得更高水平的奖励值, 本文提出了基 于自生成专家样本的探索增强算法. 首先, 为满足自生成专家样本机制以及在连续动作环境中的学习, 在双延迟深 度确定性策略梯度算法的基础上, 设置了两个经验回放池结构, 搭建了确定性策略算法的总体框架. 同时提出复合 策略更新方法, 在情节的内部循环中加入一种类同策略学习过程, 智能体在这个过程中完成对于参数空间的启发式 探索. 然后, 提出基于自生成专家样本的演示机制, 由智能体自身筛选产生专家样本, 并根据参数的更新不断调整, 进而形成动态的筛选标准, 之后智能体将模仿这些专家样本进行学习. 在OpenAI Gym的8组虚拟环境中的仿真实验 表明, 本文提出的算法能够有效提升深度强化学习的探索能力.
英文摘要
      In order to further improve the exploration ability of the deep reinforcement learning algorithm in the con- tinuous action environment, so as to obtain a higher level of reward value, an algorithm named enhance exploration with self-generated expert samples is proposed. First of all, to satisfy the self-generated expert samples mechanism and learning in the continuous action environment, on the basis of twin delayed deep deterministic policy gradient algorithm, we set up two experience replay structures and build the overall framework of the deterministic policy algorithm. Meanwhile, a combined policy update method is proposed. The approximate on-policy learning process is added to the internal loop of the episode. The agent completes the heuristic exploration of the parameter space in this process. Secondly, a demonstra- tion mechanism based on the self-generated expert samples is proposed. Expert samples are generated by the agent’s own selection, while the criteria are continuously adjusted according to the update of parameters, which could form dynamic screening criteria. After that, the agent will imitate these expert samples for learning. Simulation experiments in 8 envi- ronments in the OpenAI Gym show that the proposed algorithm can effectively improve the exploration ability of deep reinforcement learning.