基于对手池的两人格斗游戏深度强化学习

梁荣钦; 朱圆恒; 赵冬斌

引用本文:	梁荣钦,朱圆恒,赵冬斌.基于对手池的两人格斗游戏深度强化学习[J].控制理论与应用,2025,42(2):226~234.[点击复制]
	LIANG Rong-qin,ZHU Yuan-heng,ZHAO Dong-bin.Deep reinforcement learning for two-player fighting game based on opponent pool[J].Control Theory and Technology,2025,42(2):226~234.[点击复制]

基于对手池的两人格斗游戏深度强化学习

Deep reinforcement learning for two-player fighting game based on opponent pool

摘要点击 3138 全文点击 62 投稿时间：2023-10-19 修订日期：2024-10-26

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2024.30688

2025,42(2):226-234

中文关键词实时格斗游戏深度强化学习两人零和博弈对手策略池

英文关键词 real-time fighting game deep reinforcement learning two-player zero-sum game opponent policy pool

基金项目科技创新2030“新一代人工智能”重大项目(2018AAA0102404), 中国科学院战略性先导研究项目(XDA27030400)，国家自然科学基金项目 (62293541, 62136008), 中国科学院青年创新促进会项目(2021132)资助.

作者	单位	邮编
梁荣钦	中国科学院自动化研究所	100190
朱圆恒^*	中国科学院自动化研究所	100190
赵冬斌	中国科学院自动化研究所

中文摘要

双人游戏在游戏人工智能领域是一个基本且重要的问题, 其中一对一零和格斗游戏是最为典型的双人游戏之一. 本文基于深度强化学习对格斗游戏博弈对抗策略进行研究. 首先建模格斗游戏环境, 设计可用于格斗游戏决策的状态、动作以及奖赏函数, 并将阶段策略梯度算法应用于对抗策略的学习. 为了尽可能学到纳什均衡策略实现战胜任意对手的目标, 本文设计了基于历年参赛的智能体构造对手池用于智能体训练, 并探索对手选择机制对于训练过程的影响. 最后在固定对手池的基础上, 设计了自增长对手池算法, 以提升对手策略的完备性和训练智能体的鲁棒性. 为了提高环境采样速度, 本文从传统并行框架出发, 设计了可用于双人游戏的多服务器分布式并行采样框架. 通过实验对比发现, 基于自增长对手池方法所学的智能体能以96.6%的胜率击败固定对手池中的智能体, 并且在与3个仅用于测试的智能体对战时, 也表现出了72.2%的胜率.

英文摘要

In the realm of gaming artificial intelligence, two-player games represent a fundamental and crucial issue,with one-on-one zero-sum fighting games standing as one of the most quintessential forms of two-player games. In this paper, we explore adversarial strategies for fighting games based on deep reinforcement learning. We begin by constructing a model of the fighting game environment, formulating the states, actions, and reward functions that are applicable to decision-making within these games. We then employ phasic policy gradient algorithms for the learning of adversarial strategies. In pursuit of mastering Nash equilibrium strategies to triumph over any opponent, we construct an opponent pool based on intelligent agents from previous competitions for the purpose of training. We also investigate the impact of opponent selection mechanisms on the training process. Lastly, building on a fixed opponent pool, we devise a selfexpanding opponent pool algorithm to enhance the comprehensiveness of the opponent strategies and bolster the robustness of the trained agents. To expedite the process of environment sampling, we leverage conventional parallel architectures and create a distributed, multi-server parallel sampling scheme optimized for two-player games. Experimental comparisons reveal that agents trained using the self-expanding opponent pool method achieve a 96.6% win rate against agents in the fixed opponent pool. Furthermore, they also exhibit a 72.2% win rate when pitted against three agents used solely for testing purposes.