融合元学习和PPO算法的四足机器人运动技能学习方法

朱晓庆; 刘鑫源; 阮晓钢; 张思远; 李春阳; 李鹏

引用本文:	朱晓庆,刘鑫源,阮晓钢,张思远,李春阳,李鹏.融合元学习和PPO算法的四足机器人运动技能学习方法[J].控制理论与应用,2024,41(1):155~162.[点击复制]
	ZHU Xiao-qing,LIU Xin-yuan,RUAN Xiao-gang,ZHANG Si-yuan,LI Chun-yang,LI Peng.A quadruped robot kinematic skill learning method integrating meta-learning and PPO algorithms[J].Control Theory & Applications,2024,41(1):155~162.[点击复制]

融合元学习和PPO算法的四足机器人运动技能学习方法

A quadruped robot kinematic skill learning method integrating meta-learning and PPO algorithms

摘要点击 2222 全文点击 1810 投稿时间：2022-09-27 修订日期：2023-04-07

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2023.20847

2024,41(1):155-162

中文关键词四足机器人步态学习强化学习元学习

英文关键词 quadruped robot gait learning reinforcement learning meta-learning

基金项目国家自然科学基金项目(62103009), 北京市自然科学基金项目(4202005)资助.

作者	单位	E-mail
朱晓庆^*	北京工业大学	alex.zhuxq@bjut.edu.cn
刘鑫源	北京工业大学
阮晓钢	北京工业大学
张思远	北京工业大学
李春阳	北京工业大学
李鹏	北京工业大学

中文摘要

具备学习能力是高等动物智能的典型表现特征, 为探明四足动物运动技能学习机理, 本文对四足机器人步态学习任务进行研究, 复现了四足动物的节律步态学习过程. 近年来, 近端策略优化(PPO)算法作为深度强化学习的典型代表, 普遍被用于四足机器人步态学习任务, 实验效果较好且仅需较少的超参数. 然而, 在多维输入输出场景下, 其容易收敛到局部最优点, 表现为四足机器人学习到步态节律信号杂乱且重心震荡严重. 为解决上述问题, 在元学习启发下, 基于元学习具有刻画学习过程高维抽象表征优势, 本文提出了一种融合元学习和PPO思想的元近端策略优化(MPPO)算法, 该算法可以让四足机器人进化学习到更优步态. 在PyBullet仿真平台上的仿真实验结果表明, 本文提出的算法可以使四足机器人学会行走运动技能, 且与柔性行动者评价器(SAC)和PPO算法的对比实验显示, 本文提出的MPPO算法具有步态节律信号更规律、行走速度更快等优势.

英文摘要

Learning ability is a typical characteristic of higher animal intelligence. In order to explore the learning mechanism of quadruped motor skills, this paper studies the gait learning task of quadruped robots, and reproduces the rhythmic gait learning process of quadruped animals from scratch. In recent years, proximal policy optimization (PPO) algorithm, as a typical representative algorithm of deep reinforcement learning, has been widely used in gait learning tasks for quadruped robots, with good experimental results and fewer hyperparameters required. However, in the multidimensional input and output scenario, it is easy to converge to the local optimum point, in the experimental environment of this study, the gait rhythm signals of the trained quadruped robot were irregular, and the center of gravity oscillates. To solve the above problems, inspired by meta-learning, based on the advantage of meta-learning in characterizing the high-dimensional abstract representation of learning processes, this paper proposes an meta proximal policy optimization (MPPO) algorithm that combines meta-learning and PPO algorithms. This algorithm can enable quadruped robots to learn better gait. The simulation results on the PyBullet simulation platform show that the algorithm proposed in this paper can enable quadruped robots to learn walking skills. Compared with soft actor-critic (SAC) and PPO algorithms, the MPPO algorithm proposed in this paper has advantages such as more regular gait rhythm signals and faster walking speed.