引用本文: | 刘智斌,曾晓勤,徐彦,禹继国.采用资格迹的神经网络学习控制算法[J].控制理论与应用,2015,32(7):887~894.[点击复制] |
LIU Zhi-bin,ZENG Xiao-qin,XU Yan,YU Ji-guo.Learning to control by neural networks using eligibility traces[J].Control Theory and Technology,2015,32(7):887~894.[点击复制] |
|
采用资格迹的神经网络学习控制算法 |
Learning to control by neural networks using eligibility traces |
摘要点击 4806 全文点击 1569 投稿时间:2014-04-27 修订日期:2015-04-10 |
查看全文 查看/发表评论 下载PDF阅读器 |
DOI编号 10.7641/CTA.2015.40367 |
2015,32(7):887-894 |
中文关键词 强化学习 神经网络 资格迹 倒立摆 梯度下降 |
英文关键词 reinforcement learning neural networks eligibility traces cart-pole system gradient descent |
基金项目 国家自然科学基金项目(61403205, 61373027, 60117089), 曲阜师范大学实验室开放基金项目(sk201415)资助. |
|
中文摘要 |
强化学习是解决自适应问题的重要方法, 被广泛地应用于连续状态下的学习控制, 然而存在效率不高和收敛速度较慢的问题. 在运用反向传播(back propagation, BP)神经网络基础上, 结合资格迹方法提出一种算法, 实现了强化学习过程的多步更新. 解决了输出层的局部梯度向隐层节点的反向传播问题, 从而实现了神经网络隐层权值的快速更新, 并提供一个算法描述. 提出了一种改进的残差法, 在神经网络的训练过程中将各层权值进行线性优化加权, 既获得了梯度下降法的学习速度又获得了残差梯度法的收敛性能, 将其应用于神经网络隐层的权值更新, 改善了值函数的收敛性能. 通过一个倒立摆平衡系统仿真实验, 对算法进行了验证和分析. 结果显示, 经过较短时间的学习, 本方法能成功地控制倒立摆, 显著提高了学习效率. |
英文摘要 |
Reinforcement learning is an important approach to solve the adaptive learning control problems in continuous state space. However, it is bedeviled by its low learning efficiency and low convergence rate. In order to eliminate those deficiencies, based on back propagation (BP) neural networks and eligibility traces, we propose a learning algorithm with a complete description to achieve the multi-step updates in the process of reinforced learning to realize the counter propagation of the local gradient from output layer nodes to hidden layer nodes; thus, rapidly adjusting the weights of hidden layers. During the training processes of neural networks, a modified residual method is employed to optimize the weights in each layer by linear combination, achieving the rapid learning rate of the direct gradient method as well as the desired convergence properties of the residual gradient method. Applying this method to update the weights of hidden layers in a neural network, we improve the convergence properties of value functions. A cart-pole system is adopted for testing the application results of the above mentioned algorithms. Simulation results show that all our algorithms can successfully achieve the control for the cart-pole balancing system and improve the learning efficiency significantly. |
|
|
|
|
|