对手类型未知情况下的两人零和马尔科夫博弈决策

王成意; 朱进; 赵云波

引用本文:	王成意,朱进,赵云波.对手类型未知情况下的两人零和马尔科夫博弈决策[J].控制理论与应用,2024,41(11):2131~2138.[点击复制]
	WANG Cheng-yi,ZHU Jin,ZHAO Yun-bo.Decision making for two-player zero-sum Markov games with indistinguishable opponents[J].Control Theory & Applications,2024,41(11):2131~2138.[点击复制]

对手类型未知情况下的两人零和马尔科夫博弈决策

Decision making for two-player zero-sum Markov games with indistinguishable opponents

摘要点击 3411 全文点击 99 投稿时间：2022-07-15 修订日期：2024-04-12

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2023.20630

2024,41(11):2131-2138

中文关键词两人零和马尔科夫博弈非完全信息极大极小Q学习纳什均衡多智能体强化学习

英文关键词 two-player zero-sum Markov game incomplete information minimax Q-learning Nash equilibrium multiagent reinforcement learning

基金项目国家重点研发计划项目(2018AAA0100802), 安徽省自然科学基金项目(2008085MF198)资助.

作者	单位	E-mail
王成意	中国科学技术大学	wangchengyi@mail.ustc.edu.cn
朱进^*	中国科学技术大学	jinzhu@ustc.edu.cn
赵云波	中国科学技术大学

中文摘要

本文研究一类典型的非完全信息博弈问题—–对手类型未知的两人零和马尔科夫博弈, 其中对手类型多样且每次博弈开始前无法得知对手类型. 文中提出了一种基于模型的多智能体强化学习算法—–对手辨识的极大极小Q学习(DOMQ). 该算法首先建立对手相关环境的经验模型, 再使用经验模型学习纳什均衡策略, 己方智能体在实际博弈中根据经验模型判断对手类型, 从而使用相应的纳什均衡策略, 以保证收益下限. 本文所提的DOMQ算法只需要在采样阶段的每轮博弈结束后得知对手的类型, 除此之外无需知道任何环境的信息. 仿真实验验证了所提算法的有效性.

英文摘要

This paper investigates a typical class of incomplete information games – two-player zero-sum Markov games with indistinguishable opponents, where the opponent types are diverse and cannot be known at the beginning of the game. We propose a model-based multi-agent reinforcement learning algorithm – distinguishing opponent minimax Q-learning (DOMQ). The algorithm firstly builds an empirical model of the opponent-related environment; secondly uses the empirical model to learn a Nash equilibrium strategy, and then uses the corresponding Nash equilibrium strategy to guarantee the lower bound of the return in actual game. All the necessary information needed for the proposed DOMQ algorithm is the opponent type at the end of each episode in the sampling period rather than the other information about the environment. The simulation results verify the effectiveness of the proposed algorithm.