基于深度强化学习的对手建模方法研究综述

doi:10.16182/j.issn1004731x.joss.22-0555

摘要/Abstract

摘要：

深度强化学习是一种兼具深度学习特征提取能力和强化学习序列决策能力的智能体建模方法，能够弥补传统对手建模方法存在的非平稳性适应差、特征选取复杂、状态空间表示能力不足等问题。将基于深度强化学习的对手建模方法分为显式建模和隐式建模两类，按照类别梳理相应的理论、模型、算法，以及适用场景；介绍基于深度强化学习的对手建模技术在不同领域的应用情况；总结亟需解决的关键问题以及发展方向，为基于深度强化学习的对手建模方法提供较全面的研究综述。

关键词: 深度强化学习, 对手建模, 博弈论, 心智理论, 表征学习, 元学习

Abstract:

Deep reinforcement learning is an agent modeling method with both deep learning feature extraction ability and reinforcement learning sequence decision-making ability, which can make up for the depleted non-stationary adaptation, complex feature selection and insufficient state-space representation ability of traditional opponent modeling. The deep reinforcement learning-based opponent modeling methods are divided into two categories, explicit modeling and implicit modeling, and the corresponding theories, models, algorithms and applicable scenarios are sorted out according to the categories. The applications of deep reinforcement learning-based opponent modeling techniques on different fields are introduced. The key problems and future development are summarized to provide a comprehensive research review for the deep reinforcement learning-based opponent modeling methods.

Key words: deep reinforcement learning, opponent modeling, game theory, theory of mind, representation learning, meta learning

中图分类号:

TP391.9

徐浩添, 秦龙, 曾俊杰, 胡越, 张琪. 基于深度强化学习的对手建模方法研究综述[J]. 系统仿真学报, 2023, 35(4): 671-694.

Haotian Xu, Long Qin, Junjie Zeng, Yue Hu, Qi Zhang. Research Progress of Opponent Modeling Based on Deep Reinforcement Learning[J]. Journal of System Simulation, 2023, 35(4): 671-694.

图/表 9

图1

表1

图2

表2

博弈均衡策略方法的研究动机、求解问题与效果

分类	方法	研究动机	对手模型	模型效果
虚拟自博弈	FSP^[38]	将FP推广至扩展式博弈	对手的历史平均最佳响应	强化学习实现最优响应，监督学习实现平均策略，收敛至纳什均衡
	NFSP^[39]	使用神经网络近似最优策略和平均策略	多层神经网络近似的对手历史平均最佳响应	基于DQN实现端到端学习，并收敛至纳什均衡
	PSRO^[42]	求解子博弈元策略，合并成完整策略	将博弈对手的历史策略记录在元策略集	使用DO算法^[58]训练新策略，收敛性受到对手策略采样方式的影响
	$α$ -PSRO^[44]	训练改进群体的每种策略，而非单纯训练纳什均衡策略	马尔科夫-康尼链评价对手种群的质量	策略收敛于 $α$ -rank解^[43]，改进了群体博弈的均衡收敛性
反事实遗憾值最小化	MCCFR^[48]	采用蒙特卡罗抽样代替树节点遍历计算各个状态的遗憾值	包含对手所有可能行动的信息集	蒙特卡罗抽样是对遗憾值无偏估计，且在不完美信息扩展式博弈中快速收敛
反事实遗憾值最小化	CFR+^[51]	采用保证动作的遗憾值为正数，累计值不减少的遗憾值匹配方法	包含对手所有可能行动的信息集	改进遗憾值匹配机制，使CFR算法加速收敛近似纳什均衡
MiniMax 均衡	Level-0^[55]	有限理性的对手行动源自0级策略的递归推理，0级策略采用人工筛选策略	以MiniMax策略为0级的定量认知层次策略	0级策略改进认知层次模型的效果，数据集实验结果有效预测人类行为
MiniMax 均衡	M3DDPG^[56]	采用保底策略鲁棒应对变化对手的多智能体DRL算法	导致自身收益最小的对手策略	采用对抗学习方法求解连续动态环境的MiniMax均衡策略

表2

图3

图4

表3

表4

隐式DRL对手建模方法的研究动机、创新点与局限总结

类别	算法	研究动机	模型特点	创新点	局限性
辅助任务	DRON^[26]	设计挖掘不同对手策略隐藏特征的神经网络	使用MLP处理对手行动，将表征信息用于强化学习任务	提取对手特征用于DRL算法决策	手工提取输入专家网络的特征，可采用RNN改进
	DIPQN^[71]	从观测直接提取对手策略特征，训练对手建模的辅助任务	策略特征网络学习从观测提取表征，并通过行为克隆的准确性修正网络	设计了调节最大奖励与对手建模的自适应损失函数	采用经验回放池离线训练，学习的对手策略具有较大样本方差
	AMS-A3C^[72]	在强化学习过程中，制订估计其他智能体策略的辅助任务	决策网络与模仿决策的对手模型共享结构、参数，降低模型学习成本	提出参数共享、策略表征2套方案，将对手建模融合进A3C算法	对手模型参数敏感，难以应对复杂场景、具有学习能力的对手
学习表征	PPO-Emb^[74]	从交互样本中无监督地学习对手表征	提取同时具有策略提升效果和对手区分度的表征信息	无需领域知识，通用性强，适用大多数DRL算法	无法独立推断，用于辅助其他DRL算法决策
学习表征	RFM^[73]	采用图网络学习智能体的社会关系表征	通过边缘属性、节点等图结构信息预测对手行动、评估对手社会关系强度	量化智能体交互的社会属性，网络结构具有较好拓展性	存在复杂交互关系的图网络计算困难
概率推理	P-BIT^[76]	多智能体DRL的最优策略形式化为推理私有信息的概率下界	使用信念模块根据友方行为推理其私有信息	提出不完美信息条件下通过行动与队友传递私有信息的方法	适用于简单的二人合作场景
概率推理	ROMMEO^[78]	多智能体DRL形式化为基于对手模型的最优策略变分推理	预测对手行动，用于实现学习最优策略的推理任务	提出最大熵目标的正则化的对手建模方法	在线优化参数，训练时间长。默认对手目标已知，无法适应未知智能体
自我- 他人交互	SOM^[79]	基于自身策略推理对手可能的目标，用于支撑决策	建立拟合对手策略的神经网络，通过优化对手策略反向推断对手的目标	无需额外模型和参数显式建模，仿照自身模型推理任意数量规模对手	智能体与对手共享目标，并且奖励结构取决于目标
自我- 他人交互	LOLA^[81]	考虑具有学习能力的对手，解释对手学习参数的更新对自身策略影响	建模对手的价值函数，求其二阶导优化策略梯度	策略更新中增加了对手参数更新项，通过泰勒展开构造成高阶梯度项	默认对手使用可梯度优化的方法，并且无法察觉LOLA对其模型进行利用

表4

表5

参考文献 145

1	Rubinstein A. Modeling Bounded Rationality[M]. Cambridge, MA: MIT Press, 1998.
2	Wang H, Kwong S, Jin Y, et al. Agent-based Evolutionary Approach for Interpretable Rule-based Knowledge Extraction[J]. IEEE Transactions on Systems Man and Cybernetics Part C(Applications and Reviews) (S1094-6977), 2005, 35(2): 143-155.
3	Nguyen H V, Rezatofighi H, Vo B N, et al. Multi-Objective Multi-Agent Planning for Jointly Discovering and Tracking Mobile Objects[C]//AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020, 34(5): 7227-7235.
4	Sartoretti G, Kerr J, Shi Y, et al. PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning[J]. IEEE Robotics & Automation Letters (S2377-3766), 2019, 4(3): 2378-2385.
5	Albrecht S V, Stone P. Autonomous Agents Modelling other Agents: A Comprehensive Survey and Open Problems[J]. Artificial Intelligence(S0004-3702), 2017, 258: 66-95.
6	Volodymyr M, Koray K, David S, et al. Human-level Control through Deep Reinforcement Learning[J]. Nature(S0028-0836), 2019, 518(7540): 529-533.
7	Enmin Z, Renye Y, Jinqiu L, et al. AlphaHoldem: High-Performance Artificial Intelligence for Heads-Up No-Limit Texas Hold'em from End-to-End Reinforcement Learning[C]//AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2022, 36(4): 4689-4697.
8	张蒙, 李凯, 吴哲, 等. 一种针对德州扑克AI的对手建模与策略集成框架[J]. 自动化学报, 2022, 48(4): 1004-1017.
	Zhang Meng, Li Kai, Wu Zhe, et al. An Opponent Modeling and Strategy Integration Framework for Texas Hold'em[J]. Acta Automatica Sinica, 2022, 48(4): 1004-1017.
9	Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster Level in StarCraft II Using Multi-agent Reinforcement Learning[J]. Nature(S0028-0836), 2019, 575(7782): 350-354.
10	Bakkes S C J, Spronck P H M, Van Den Herik H J. Opponent Modelling for Case-Based Adaptive Game AI[J]. Entertainment Computing(S1875-9521), 2009, 1(1): 27-37.
11	罗俊仁, 张万鹏, 袁唯淋, 等. 面向多智能体博弈对抗的对手建模框架[J]. 系统仿真学报, 2022, 34(9): 1941-1955.
	Luo Junren, Zhang Wanpeng, Yuan Weilin, et al. Research on Opponent Modeling Framework for Multi-agent Game Confrontation[J]. Journal of System Simulation, 2022, 34(9): 1941-1955.
12	刘婵娟, 赵天昊, 刘睿康, 等. 智能体对手建模研究进展[J]. 图学学报, 2021, 42(5): 703-711.
	Liu Chanjuan, Zhao Tianhao, Liu Ruikang, et al. Research Progress of Opponent Modeling for Agent[J]. Journal of Graphics, 2021, 42(5): 703-711.
13	Hernandez-Leal P, Kartal B, Taylor M E. A Survey and Critique of Multiagent Deep Reinforcement Learning[J]. Autonomous Agents and Multi-Agent Systems(S1387-2532), 2019, 33(6): 750-797.
14	Tuyls K, Stone P. Multiagent Learning Paradigms[C]//Multi-Agent Systems and Agreement Technologies. 15th European Conference, EUMAS 2017, and 5th International Conference, AT 2017. Evry, France: Springer International Publishing, 2018: 3-21.
15	Nashed S, Zilberstein S. A Survey of Opponent Modeling in Adversarial Domains[J]. Journal of Artificial Intelligence Research(S1076-9757), 2022, 73: 277-327.
16	Powers R, Shoham Y. Learning Against Opponents with Bounded Memory[C]//19th International Joint Conference on Artificial Intelligence. Edinburgh, Scotland: Morgan Kaufmann, 2005: 817-822.
17	Chakraborty D, Stone P. Cooperating with A Markovian Ad Hoc Teammate[C]//2013 International Conference on Autonomous Agents and Multi-agent Systems. Beijing, China: AAMAS, 2013: 1085-1092.
18	De Weerd H, Verbrugge R, Verheij B. Negotiating with other Minds: The Role of Recursive Theory of Mind in Negotiation with Incomplete Information[J]. Autonomous Agents and Multi-Agent Systems(S1387-2532), 2017, 31(2): 250-287.
19	Sonu E, Doshi P. Scalable Solutions of Interactive POMDPs Using Generalized and Bounded Policy Iteration[J]. Autonomous Agents and Multi-Agent Systems(S1387-2532), 2015, 29(3): 455-494.
20	Zeng Y, Doshi P. Exploiting Model Equivalences for Solving Interactive Dynamic Influence Diagrams[J]. Journal of Artificial Intelligence Research(S1076-9757), 2012, 43: 211-255.
21	Doshi P, Zeng Y, Chen Q. Graphical Models for Interactive POMDPs: Representations and Solutions[J]. Autonomous Agents and Multi-Agent Systems(S1387-2532), 2009, 18(3): 376-416.
22	Barrett S, Stone P. Cooperating with Unknown Teammates in Complex Domains: A Robot Soccer Case Study of ad Hoc Teamwork[C]//Twenty-Ninth AAAI Conference on Artificial Intelligence. Austin, USA: AAAI 2015: 2010-2016.
23	Erdogan C, Veloso M. Action Selection via Learning Behavior Patterns in Multi-Robot Systems[C]//Twenty-Second International Joint Conference on Artificial Intelligence. Barcelona, Spain: Morgan Kaufmann, 2011: 192-197.
24	Weber B G, Mateas M. A Data Mining Approach to Strategy Prediction[C]//2009 IEEE Symposium on Computational Intelligence and Games. Milan, Italy: IEEE, 2009: 140-147.
25	Schadd F, Bakkes S, Spronck P. Opponent Modeling in Real-Time Strategy Games[C]//The 8th Annual European Game-On Conference on Simulation and AI in Computer Games(GAMEON). Bologna, Italy: Marco Roccetti, 2007: 61-70.
26	He H, Boyd-Graber J, Kwok K, et al. Opponent Modeling in Deep Reinforcement Learning[C]//International Conference on Machine Learning. New York, USA: PMLR, 2016: 1804-1813.
27	Baker C, Saxe R, Tenenbaum J. Bayesian Theory of Mind: Modeling Joint Belief-desire Attribution[C]//Annual Meeting of the Cognitive Science Society. Boston, USA: Cognitive Science Society, 2011: 2469-2474.
28	Fern A, Tadepalli P. A Computational Decision Theory for Interactive Assistants[C]//23rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc, 2010: 577-585.
29	Sohrabi S, Riabov A V, Udrea O. Plan Recognition as Planning Revisited[C]//Twenty-Fifth International Joint Conference on Artificial Intelligence. New York, USA: AAAI, 2016: 3258-3264.
30	Albrecht S V, Stone P. Reasoning about Hypothetical Agent Behaviours and Their Parameters[C]//International Conference on Autonomous Agents and Multiagent Systems. Richland, USA: International Foundation for Autonomous Agents and Multiagent Systems, 2017: 547-555.
31	Albrecht S V, Crandall J W, Ramamoorthy S. Belief and Truth in Hypothesised Behaviours[J]. Artificial Intelligence (S0004-3702), 2016, 235: 63-94.
32	孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题[J]. 自动化学报, 2020, 46(7): 1301-1312. DOI:10.16383/j.aas.c200159 .
	Sun Changyin, Mu Chaoxu. Important Scientific Problems of Multi-Agent Deep Reinforcement Learning[J]. Acta Automatica Sinica, 2020, 46(7): 1301-1312. DOI:10.16383/j.aas.c200159 .
33	Hausknecht M, Stone P. Deep Recurrent Q-Learning for Partially Observable MDPs[C]//2015 AAAI Fall Symposium Series. Arlington, Virginia: AAAI, 2015.
34	Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous Control with Deep Reinforcement Learning[J/OL]. ArXiv Preprint ArXiv:1509.02971, 2015. [2022-03-25]. .
35	Mnih V. Asynchronous Methods for Deep Reinforcement Learning[C]//33rd International Conference on International Conference on Machine Learning. New York, USA: PMLR, 2016: 1928-1937.
36	Schulman J, Levine S, Moritz P, et al. Trust Region Policy Optimization[C]//32nd International Conference on International Conference on Machine Learning. Lille, France: PMLR, 2015: 1889-1897.
37	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[J/OL]. ArXiv Preprint ArXiv:1707.06347, 2017. [2022-03-25]. .
38	Heinrich J, Lanctot M, Silver D. Fictitious Self-play in Extensive-form Games[C]//32nd International Conference on International Conference on Machine Learning. Lille, France: PMLR, 2015: 805-813.
39	Heinrich J, Silver D. Deep Reinforcement Learning from Self-play in Imperfect-Information Games[J/OL]. ArXiv Preprint ArXiv:1603.01121, 2016. [2022-03-25]. .
40	Kawamura K, Tsuruoka Y. Neural Fictitious Self-play on ELF Mini-rts[J/OL]. ArXiv Preprint ArXiv:1902.02004, 2019. [2022-03-25]. .
41	Zhang L, Chen Y, Wang W, et al. A Monte Carlo Neural Fictitious Self-Play Approach to Approximate Nash Equilibrium in Imperfect-information Dynamic Games[J]. Frontiers of Computer Science(S2095-2228), 2021, 15(5): 1-14.
42	Lanctot M, Zambaldi V, Gruslys A, et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning[C]//31st Conference on Neural Information Processing Systems(NIPS). Long Beach, USA: MIT Press, 2017: 4193-4206.
43	Omidshafiei S, Papadimitriou C, Piliouras G, et al. α-rank: Multi-agent Evaluation by Evolution[J]. Scientific Reports(S2045-2322), 2019, 9(1): 1-29.
44	Muller P, Omidshafiei S, Rowland M, et al. A Generalized Training Approach for Multiagent Learning[J/OL]. ArXiv Preprint ArXiv:1909.12823, 2019. [2022-03-25]. .
45	Balduzzi D, Garnelo M, Bachrach Y, et al. Open-ended Learning in Symmetric Zero-sum Games[C]//International Conference on Machine Learning. Long Beach, USA: PMLR, 2019: 434-443.
46	Martin Z, Michael J, Michael B, et al. Regret Minimization in Games with Incomplete Information[C]//20th International Conference on Neural Information Processing Systems(NIPS'07). Red Hook, New York, USA: Curran Associates Inc, 2007: 1729-1736.
47	Gibson R. Regret Minimization in Games and the Development of Champion Multiplayer Computer Poker-Playing Agents[D]. Edmonton: University of Alberta, 2014.
48	Johanson M, Bard N, Lanctot M, et al. Efficient Nash Equilibrium Approximation through Monte Carlo Counterfactual Regret Minimization[C]//International Conference on Autonomous Agents and Multiagent Systems. Valencia, Spain: International Foundation for Autonomous Agents and Multiagent Systems, 2012: 837-846.
49	Lanctot M. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games[D]. Edmonton: University of Alberta, 2013.
50	Brown N, Sandholm T. Reduced Space and Faster Convergence in Imperfect-information Games via Pruning[C]//International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 596-604.
51	Tammelin O. Solving Large Imperfect Information Games Using CFR+[J/OL]. ArXiv Preprint ArXiv:1407. 5042, 2014. [2022-03-25]. .
52	Gilpin A, Sandholm T. Better Automated Abstraction Techniques for Imperfect Information Games, with Application to Texas Hold'em Poker[C]//International Joint Conference on Autonomous Agents and Multiagent Systems. New York, USA: International Foundation for Autonomous Agents and Multiagent Systems, 2007: 1-8.
53	Waugh K, Schnizlein D, Bowling M, et al. Abstraction Pathologies in Extensive Games[C]//8th International Conference on Autonomous Agents and Multiagent Systems. Budapest, Hungary: International Foundation for Autonomous Agents and Multiagent Systems, 2009: 781-788.
54	王鹏程. 基于深度强化学习的非完备信息机器博弈研究[D]. 哈尔滨: 哈尔滨工业大学, 2016.
	Wang Pengcheng. Research on Imperfect Information Machine Game Based on Deep Reinforcement Learning[D]. Harbin: Harbin Institute of Technology, 2016.
55	Wright J R, Leyton-Brown K. Level-0 Models for Predicting Human Behavior in Games[J]. Journal of Artificial Intelligence Research(S1076-9757), 2019, 64: 357-383.
56	Li S, Wu Y, Cui X, et al. Robust Multi-agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient[C]//AAAI Conference on Artificial Intelligence. Hawaii, USA: AAAI, 2019, 33: 4213-4220.
57	Lowe R, Wu Y, Tamar A, et al. Multi-agent Actor-critic for Mixed Cooperative-competitive Environments[C]//31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc, 2017: 6382-6393.
58	McMahan H B, Gordon G J, Blum A. Planning in the Presence of Cost Functions Controlled by an Adversary[C]//20th International Conference on Machine Learning (ICML-03). Helsinki, Finland: ACM, 2003: 536-543.
59	Frith C, Frith U. Theory of Mind[J]. Current Biology(S0960-9822), 2001, 15(17): R644-R645.
60	Rabinowitz N C, Perbet F, Song H F, et al. Machine Theory of Mind[C]//35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018: 4218-4227.
61	Aucher G, Bolander T. Undecidability in Epistemic Planning[C]//23rd International Joint Conference on Artificial Intelligence. Beijing, China: Morgan Kaufmann, 2013: 27-33.
62	Hartford J, Wright J R, Leyton-Brown K. Deep Learning for Predicting Human Strategic Behavior[C]//30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc, 2016: 2432-2440.
63	Wen Y, Yang Y, Luo R, et al. Probabilistic Recursive Reasoning for Multi-agent Reinforcement Learning[J/OL]. ArXiv Preprint ArXiv:1901.09207, 2019. [2022-03-25]. .
64	Wen Ying, Yang Yaodong, Luo Rui, et al. Modelling Bounded Rationality in Multi-agent Interactions by Generalized Recursive Reasoning[C]//Twenty-Ninth International Joint Conference on Artificial Intelligence(IJCAI-20). Yokohama, Japan: AAAI, 2020: 414-421.
65	Annie W, Thomas B, Anna V, et al. Multiagent Deep Reinforcement Learning: Challenges and Directions Towards Human-Like Approaches[J]. Artificial Intelligence Review(S0269-2821), 2022. DOI:10.1007/s10462-022-10299-x . [2022-03-25]. .
66	Everett R, Roberts S. Learning Against Non-stationary Agents with Opponent Modelling and Deep Reinforcement Learning[C]//2018 AAAI Spring Symposium Series. Palo Alto, California: AAAI, 2018.
67	Hernandez-Leal P, Rosman B, Taylor M E, et al. A Bayesian Approach for Learning and Tracking Switching, Non-stationary opponents[C]//2016 International Conference on Autonomous Agents & Multiagent Systems. Singapore, Singapore: International Foundation for Autonomous Agents and Multiagent Systems, 2016: 1315-1316.
68	Zheng Y, Meng Z, Hao J, et al. A Deep Bayesian Policy Reuse Approach against Non-stationary Agents[C]//32nd International Conference on Neural Information Processing Systems. Red Hook, New York: Curran Associates Inc, 2018: 962-972.
69	Yang T, Hao J, Meng Z, et al. Towards Efficient Detection and Optimal Response Against Sophisticated Opponents[C]//28th International Joint Conference on Artificial Intelligence. Macao, China: AAAI, 2019: 623-629.
70	Lample G, Chaplot D S. Playing FPS Games with Deep Reinforcement Learning[C]//Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI, 2017: 2140-2146.
71	Hong Z W, Su S Y, Shann T Y, et al. A Deep Policy Inference Q-Network for Multi-agent Systems[C]//17th International Conference on Autonomous Agents and Multi-Agent Systems. Stockholm, Sweden: International Foundation for Autonomous Agents and Multiagent Systems, 2018: 1388-1396.
72	Hernandez-Leal P, Kartal B, Taylor M E. Agent Modeling as Auxiliary Task for Deep Reinforcement Learning[C]//AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. Palo Alto, California: AAAI, 2019, 15(1): 31-37.
73	Tacchetti A, Song H F, Mediano P A M, et al. Relational Forward Models for Multi-agent Learning[J/OL]. ArXiv Preprint ArXiv:1809.11044, 2018. [2022-03-25]. .
74	Grover A, Al-Shedivat M, Gupta J, et al. Learning Policy Representations in Multiagent Systems[C]//37th International Conference on Machine Learning Conference. Vienna, Austria: PMLR, 2018: 1802-1811.
75	Ha D, Schmidhuber J. Recurrent World Models Facilitate Policy Evolution[C]//32nd International Conference on Neural Information Processing Systems. Red Hook, New York, USA: Curran Associates Inc, 2018: 2455-2467.
76	Tian Z, Zou S, Davies I, et al. Learning to Communicate Implicitly by Actions[C]//AAAI Conference on Artificial Intelligence. New York: AAAI, 2020, 34(5): 7261-7268.
77	Haarnoja T, Zhou A, Hartikainen K, et al. Soft Actor-critic Algorithms and Applications[J/OL]. ArXiv Preprint ArXiv:1812.05905, 2018. [2022-03-25]. .
78	Zheng T, Ying W, Zhichen G, et al. A Regularized Opponent Model with Maximum Entropy Objective[J/OL]. ArXiv Preprint ArXiv:1905.08087, 2019. [2022-03-25]. .
79	Raileanu R, Denton E, Szlam A, et al. Modeling Others Using Oneself in Multi-agent Reinforcement Learning[C]//International Conference on Machine Learning. Vienna, Austria: PMLR, 2018: 4257-4266.
80	Zhang C, Lesser V. Multi-agent Learning with Policy Prediction[C]//Twenty-Fourth AAAI Conference on Artificial Intelligence. Atlanta, USA: AAAI, 2010: 927-934.
81	Foerster J N, Chen R Y, Al-Shedivat M, et al. Learning with Opponent-learning Awareness [C]//17th International Conference on Autonomous Agents and MultiAgent Systems. Stockholm, Sweden: International Foundation for Autonomous Agents and Multiagent Systems, 2018: 122-130.
82	Moravík Matej, Schmid M, Burch N, et al. DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker[J]. Science(S0036-8075), 2017, 356(6337): 508-513.
83	Brown N, Sandholm T. Superhuman AI for Heads-up No-limit Poker: Libratus Beats Top Professionals[J]. Science(S0036-8075), 2017, 359(6374): 418-424.
84	Tang Z, Zhu Y, Zhao D. Enhanced Rolling Horizon Evolution Algorithm with Opponent Model Learning[J]. IEEE Transactions on Games(S2475-1502), 2020. DOI:10.1109/TG.2020.3022698 . [2022-03-25]. .
85	Huang S, Su H, Zhu J, et al. Combo-action: Training Agent for FPS Game with Auxiliary Tasks[C]//AAAI Conference on Artificial Intelligence. Hawaii, USA: AAAI, 2019, 33(1): 954-961.
86	Wang K, Chang K C, Chang Z W. Determinants of We-intention for Continue Playing FPS Game: Cooperation and Competition[C]//7th Multidisciplinary in International Social Networks Conference and the 3rd International Conference on Economics, Management and Technology. Kaohsiung, Taiwan, China: ACM, 2020: 1-9.
87	Yu X, Jiang J, Jiang H, et al. Model-based Opponent Modeling[J/OL]. ArXiv Preprint ArXiv:2108.01843, 2021. [2022-03-25]. .
88	Iglesias J A, Ledezma A, Sanchis A. Opponent Modeling in RoboCup Soccer Simulation[C]//Workshop of Physical Agents. Cham: Springer, 2018: 303-316.
89	Wu Z, Li K, Zhao E, et al. L2e: Learning to Exploit Your Opponent[J/OL]. ArXiv Preprint ArXiv:2102.09381, 2021. [2022-03-25]. .
90	Dzieńkowski B J, Strode C, Markowska-Kaczmar U. Employing Game Theory and Computational Intelligence to Find the Optimal Strategy of an Autonomous Underwater Vehicle Against A Submarine[C]//2016 Federated Conference on Computer Science and Information Systems(FedCSIS). Gdansk, Poland: IEEE, 2016: 31-40.
91	Changqiang H, Kangsheng D, Hanqiao H, et al. Autonomous Air Combat Maneuver Decision Using Bayesian Inference and Moving Horizon Optimization[J]. Journal of Systems Engineering and Electronics(S1004-4132), 2018, 29(1): 86-97.
92	Zhou K, Wei R, Zhang Q, et al. Learning System for Air Combat Decision Inspired by Cognitive Mechanisms of the Brain[J]. IEEE Access(S2169-3536), 2020, 8: 8129-8144.
93	施伟, 冯旸赫, 程光权, 等. 基于深度强化学习的多机协同空战方法研究[J]. 自动化学报, 2021, 47(7): 1610-1623. DOI:10.16383/j.aas.c201059 .
	Shi Wei, Feng Yanghe, Cheng Guangquan, et al. Research on Multi-aircraft Cooperative Air Combat Method Based on Deep Reinforcement Learning[J]. Acta Automatica Sinica, 2021, 47(7): 1610-1623. DOI:10.16383/j.aas.c201059 .
94	Defense Advanced Research Projects Agency. Constructive Machine-learning Battles with Adversary-Tactics[EB/OL]. (2021-07-21)[2022-03-28]. .
95	Parsons D, Surdu J, Jordan B. OneSAF: a next Generation Simulation Modeling the Contemporary Operating Environment[C]//Euro-simulation Interoperability Workshop. Toulouse, France: Simulation Interoperability Standards Organization (SISO), 2005: 27-29.
96	Liu X, Zhao M, Dai S, et al. Tactical Intention Recognition in Wargame[C]//2021 IEEE 6th International Conference on Computer and Communication Systems(ICCCS). Chengdu, China: IEEE, 2021: 429-434.
97	王震, 袁勇, 安波, 等. 安全博弈论研究综述[J]. 指挥与控制学报, 2015, 1(2): 121-149.
	Wang Zhen, Yuan Yong, An Bo, et al. An Overview of Security Games[J]. Journal of Command and Control, 2015, 1(2): 121-149.
98	Zhang Y, An B, Tran-Thanh L, et al. Optimal Escape Interdiction on Transportation Networks[C]//26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI, 2017: 3936-3944.
99	Xue W, Zhang Y, Li S, et al. Solving Large-scale Extensive-form Network Security Games via Neural Fictitious Self-play[J/OL]. ArXiv Preprint ArXiv:2106.00897, 2021. [2022-03-25]. .
100	Jain M, Korzhyk D, Vaněk O, et al. A Double Oracle Algorithm for Zero-sum Security Games on Graphs[C]//The 10th International Conference on Autonomous Agents and Multiagent Systems. Taipei, Taiwan, China: International Foundation for Autonomous Agents and Multiagent Systems, 2011: 327-334.
101	Zhang Y, Guo Q, An B, et al. Optimal Interdiction of Urban Criminals with the Aid of Real-time Information[C]//AAAI Conference on Artificial Intelligence. Hawaii, USA: AAAI, 2019, 33(1): 1262-1269.
102	Tian R, Li S, Li N, et al. Adaptive Game-theoretic Decision Making for Autonomous Vehicle Control at Roundabouts[C]//2018 IEEE Conference on Decision and Control(CDC). Miami Beach, USA: IEEE, 2018: 321-326.
103	Notomista G, Wang M, Schwager M, et al. Enhancing Game-theoretic Autonomous Car Racing Using Control Barrier Functions[C]//2020 IEEE International Conference on Robotics and Automation(ICRA). Paris, France: IEEE, 2020: 5393-5399.
104	Okamoto S, Hazon N, Sycara K. Solving Non-zero Sum Multiagent Network Flow Security Games with Attack Costs[C]//11th International Conference on Autonomous Agents and Multiagent Systems. Valencia, Spain: International Foundation for Autonomous Agents and Multiagent Systems, 2012(2): 879-888.
105	Brockman G, Cheung V, Pettersson L, et al. Openai Gym[J/OL]. ArXiv Preprint ArXiv:1606.01540, 2016. [2022-03-25]. .
106	Todorov E, Erez T, Tassa Y. Mujoco: A Physics Engine for Model-based Control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vilamoura, Portugal: IEEE, 2012: 5026-5033.
107	Lu F, Yamamoto K, Nomura L H, et al. Fighting Game Artificial Intelligence Competition Platform[C]//2013 IEEE 2nd Global Conference on Consumer Electronics (GCCE). Tokyo, Japan: IEEE, 2013: 320-323.
108	Littman M L. Markov Games as A Framework for Multi-agent Reinforcement Learning[M]// Machine Learning Proceedings. San Mateo, CA: Morgan Kaufmann, 1994: 157-163.
109	Wang X, Sandholm T. Reinforcement Learning to Play An Optimal Nash Equilibrium in Team Markov Games[C]//15th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2002: 1603-1610.
110	Monahan G E. State of the Art-A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms[J]. Management Science(S0025-1909), 1982, 28(1): 1-16.
111	Kuhn H W, Tucker A W. Contributions to the Theory of Games[M]. Princeton, New Jersey: Princeton University Press, 1953.
112	Papoudakis G, Albrecht S V. Variational Autoencoders for Opponent Modeling in Multi-agent Systems[J/OL]. ArXiv Preprint ArXiv:2001.10829, 2020. [2022-03-25]. .
113	Papoudakis G, Christianos F, Albrecht S. Agent Modelling under Partial Observability for Deep Reinforcement Learning[C]//Advances in Neural Information Processing Systems. Virtual: Curran Associates Inc. 2021, 34: 19210-19222.
114	Davies I, Tian Z, Wang J. Learning to Model Opponent Learning[C]//AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020, 34(10): 13771-13772.
115	Shen M, How J P. Robust Opponent Modeling via Adversarial Ensemble Reinforcement Learning[C]//International Conference on Automated Planning and Scheduling. Guangzhou, China: AAAI, 2021, 31: 578-587.
116	Wang T, Bao X, Clavera I, et al. Benchmarking Model-based Reinforcement Learning[J/OL]. ArXiv Preprint ArXiv:1907.02057, 2019. [2022-03-25]. .
117	Papoudakis G, Christianos F, Rahman A, et al. Dealing with Non-stationarity in Multi-agent Deep Reinforcement Learning[J/OL]. ArXiv Preprint ArXiv:1906.04737, 2019. [2022-03-25]. .
118	Cemgil T, Ghaisas S, Dvijotham K, et al. The Autoencoding Variational Autoencoder[C]//Advances in Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. 2020, 33: 15077-15087.
119	Hastie T, Friedman J, Tibshirani R. The Elements of Statistical Learning: Data mining, Inference, and Prediction[M]. Berlin, German: Springer, 2001.
120	Genc S, Mallya S, Bodapati S, et al. Zero-shot Reinforcement Learning with Deep Attention Convolutional Neural Networks[J/OL]. ArXiv Preprint ArXiv:2001.00605, 2020. [2022-03-25]. .
121	Wang Y, Yao Q, Kwok J T, et al. Generalizing from a Few Eamples: A Survey on Few-shot Learning[J]. ACM Computing Surveys(CSUR)(S0360-0300), 2020, 53(3): 1-34.
122	Finn C, Abbeel P, Levine S. Model-agnostic Meta-Learning for Fast Adaptation of Deep Networks[C]//International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 1126-1135.
123	Southey F, Bowling M, Larson B, et al. Bayes' Bluff: Opponent Modelling in Poker[C]//Twenty-first Conference on Uncertainty in Artificial Intelligence. Edinburgh, Scotland: AUAI, 2005: 550-558.
124	Peng P, Wen Y, Yang Y, et al. Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play Starcraft Combat Games[J/OL]. ArXiv preprint ArXiv:1703.10069, 2017. [2022-03-25]. .
125	Bontrager P, Khalifa A, Anderson D, et al. "Superstition" in the Network: Deep Reinforcement Learning Plays Deceptive Games[C]//AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. Atlanta, USA: AAAI, 2019, 15(1): 10-16.
126	Johnson P E, Grazioli S, Jamal K. Fraud detection: Intentionality and Deception in Cognition[J]. Accounting, Organizations and Society(S0361-3682), 1993, 18(5): 467-488.
127	Masters P, Kirley M, Smith W. Extended Goal Recognition: A Planning-based Model for Strategic Deception[C]//20th International Conference on Autonomous Agents and Multi-agent Systems. Virtual Event, United Kingdom: International Foundation for Autonomous Agents and Multiagent Systems, 2021: 871-879.
128	Wang Z, Boularias A, Mülling K, et al. Balancing Safety and Exploitability in Opponent Modeling[C]//AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI, 2011, 25(1): 1515-1520.
129	Ganzfried S, Sandholm T. Safe Opponent Exploitation[J]. ACM Transactions on Economics and Computation (TEAC)(S2167-8375), 2015, 3(2): 1-28.
130	Deisenroth M, Rasmussen C E. PILCO: A Model-based and Data-efficient Approach to Policy Search[C]//28th International Conference on Machine Learning (ICML-11). Kyoto, Japan: PMLR, 2011: 465-472.
131	Chua K, Calandra R, Mc Allister R, et al. Deep Reinforcement Learning in a Handful of Trials Using Probabilistic Dynamics Models[C]//32nd International Conference on Neural Information Processing Systems. Red Hook, New York, USA: Curran Associates Inc, 2018: 4759-4770.
132	Zhang W, Wang X, Shen J, et al. Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts[J/OL]. ArXiv preprint ArXiv:2105.03363, 2021. [2022-03-25]. .
133	Pan X, Seita D, Gao Y, et al. Risk Averse Robust Adversarial Reinforcement Learning[C]//2019 International Conference on Robotics and Automation (ICRA). Montreal, Canada: IEEE, 2019: 8522-8528.
134	Vinitsky E, Du Y, Parvate K, et al. Robust Reinforcement Learning Using Adversarial Populations[J/OL]. ArXiv Preprint ArXiv:2008.01825, 2020. [2022-03-25]. .
135	Ramoni M, Sebastiani P. Robust Learning with Missing Data[J]. Machine Learning(S0885-6125), 2001, 45(2): 147-170.
136	Steinhardt J. Robust Learning: Information Theory and Algorithms[M]. Palo Alto, CA: Stanford University, 2018.
137	Pinto L, Davidson J, Sukthankar R, et al. Robust Adversarial Reinforcement Learning[C]//International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 2817-2826.
138	Shioya H, Iwasawa Y, Matsuo Y. Extending Robust Adversarial Reinforcement Learning Considering Adaptation and Diversity[J/OL]. ArXiv Preprint ArXiv:1703.02702, 2017. [2022-03-25]. .
139	Sagi O, Rokach L. Ensemble Learning: A Survey[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery(S1942-4787), 2018, 8(4): e1249.
140	Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network[J/OL]. ArXiv Preprint ArXiv:1503.02531, 2015. [2022-03-25]. .
141	Wang J X, Kurth-Nelson Z, Kumaran D, et al. Prefrontal Cortex as a Meta-reinforcement Learning System[J]. Nature Neuroscience(S1097-6256), 2018, 21(6): 860-868.
142	Nagabandi A, Clavera I, Liu S, et al. Learning to Adapt in Dynamic, Real-world Environments through Meta-reinforcement Learning[J/OL]. ArXiv Preprint ArXiv:1803.11347, 2018. [2022-03-25]. .
143	Yu T, Quillen D, He Z, et al. Meta-world: A Benchmark and Evaluation for Multi-task and Meta Reinforcement Learning[C]//Conference on Robot Learning. Virtual Event/Cambridge, USA: PMLR, 2020: 1094-1100.
144	Gupta A, Mendonca R, Liu Y X, et al. Meta-reinforcement Learning of Structured Exploration Strategies[C]//32nd International Conference on Neural Information Processing Systems. Montréal Canada: Curran Associates Inc, 2018: 5307-5316.
145	Kim D K, Liu M, Riemer M D, et al. A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning[C]//International Conference on Machine Learning. Virtual Event: PMLR, 2021: 5541-5550.

分类	算法	优点	不足
值函数近似	DQN^[6]	经验复用、离轨策略机制	无法用于高维、连续空间
值函数近似	DRQN^[33]	采用LSTM代替全连接层	完全可观测下表现不如DQN
策略梯度	DDPG^[34]	确定性策略、Actor-Critic框架	无法处理离散问题、难以确定更新步长
	A3C^[35]	多线程学习、异步更新参数	更新策略方差较大
	PPO^[37]	有裁剪的自适应超参数KL散度	对差异性较大样本敏感

类别	算法	研究动机	模型用途	创新点	局限性
心智理论	ToMnet^[60]	从心智理论提出符合人类认知的元学习对手模型	预测的对手行为、目标、信念	建立元学习的先验模型，用于预测表征和心智状态	适用的实验场景简单，环境完全可观
认知层次结构	PR2^[63]	智能体具有推断对手策略的信念递归推理能力	推理对手下一步意图	提出多智能体概率递归推理的分布式框架，利用变分贝叶斯推理对手策略	二人博弈场景收敛，复杂合作场景中表现不足
认知层次结构	GR2^[64]	借助不同层次结构的递归推理建模对手的有限理性	以K层深度推理对手的下一步意图	设计了基于概率图模型的层次结构，并证明存在完美贝叶斯均衡	具有递归推理层级选择问题，带来更高计算要求
贝叶斯策略复用	DPN-BPR+^[68]	针对非平稳的对手策略，提出策略检测和复用机制	根据收益更新对当前对手策略的信念	深度神经网络作为BPR+的值函数近似，使用网络蒸馏存储最优响应策略	假定对手在固定策略之间切换，无法识别连续演化的对手策略
贝叶斯策略复用	Deep Bayes ToMop^[69]	将BPR预测能力和心智理论的递归推理能力结合互补	在BPR信念基础上多层递归推理	具有学习对手演化和应对未知对手策略的能力	在线学习新策略的耗时长，无法应对多个对手

实验环境	博弈模型	文献	可观测信息	合作关系	行动顺序	状态动作
粒子世界	POMDP	[56-57,112-114]	部分可观	混合	同步	连续
德州扑克	EG	[38,40-42]	全局可观	竞争	序贯	离散
囚徒/硬币博弈	MG	[80]	全局可观	竞争	同步	离散
多智能体Mujoco	POMDP	[115]	部分可观	混合	同步	连续
网格世界	MG	[66-69]	全局可观	混合	同步	连续
迭代矩阵游戏	Team MG	[64,78]	全局可观	竞争	同步	离散
智力竞赛碗	EG	[26,71]	全局可观	竞争	序贯	离散
炸弹人	MG	[72]	全局可观	竞争	同步	离散
合作导航	Dec-POMDP	[63-64,116]	部分可观	合作	同步	离散
FightingICE	MG	[84,107]	全局可观	竞争	同步	连续
谷歌足球环境	POMDP	[87]	部分可观	混合	同步	连续

[1]	史佳洁, 杨鹏, 皮雁南. 基于机器学习的地铁行人流在线优化控制研究[J]. 系统仿真学报, 2023, 35(2): 386-395.
[2]	罗俊仁, 张万鹏, 袁唯淋, 胡振震, 陈少飞, 陈璟. 面向多智能体博弈对抗的对手建模框架[J]. 系统仿真学报, 2022, 34(9): 1941-1955.
[3]	张森, 张孟炎, 邵敬平, 普杰信. 基于随机策略搜索的多机三维路径规划方法[J]. 系统仿真学报, 2022, 34(6): 1286-1295.
[4]	倪凌佳, 黄晓霞, 李红旮, 张子博. 基于协作式深度强化学习的火灾应急疏散仿真研究[J]. 系统仿真学报, 2022, 34(6): 1353-1366.
[5]	王红微, 杨鹏. 基于深度强化学习的机场货运业务优化研究[J]. 系统仿真学报, 2022, 34(3): 651-660.
[6]	李启锐, 彭心怡. 基于深度强化学习的云作业调度及仿真研究[J]. 系统仿真学报, 2022, 34(2): 258-268.
[7]	徐颖, 刘勤明, 周林森. 基于博弈论的闭环双渠道回收供应链决策研究[J]. 系统仿真学报, 2022, 34(2): 396-408.
[8]	高昂, 董志明, 张国辉, 梁涛, 郭齐胜. LVC训练系统中计算机生成兵力生成技术研究[J]. 系统仿真学报, 2021, 33(3): 745-752.
[9]	曾贲, 房霄, 孔德帅, 宋祥祥, 贾政轩, 林廷宇. 一种数据驱动的对抗博弈智能体建模方法[J]. 系统仿真学报, 2021, 33(12): 2838-2845.
[10]	陈斐然, 陈彬, 朱正秋, 邱晓刚, 王一多, 赵勇. 基于博弈论的化工园区合作巡逻策略研究[J]. 系统仿真学报, 2020, 32(10): 1903-1909.
[11]	张烙兵, 阮启明, 邱晓刚. 研究社会复杂系统的两种互补方法：仿真与博弈论[J]. 系统仿真学报, 2019, 31(10): 1960-1969.
[12]	马忠贵, 班莎, 陈桂梅, 陈林旗. 认知无线网络自适应功率控制博弈算法[J]. 系统仿真学报, 2015, 27(3): 584-590.
[13]	张国锋, 王牛, 熊虎. 情绪博弈的行为计算原理[J]. 系统仿真学报, 2015, 27(1): 29-36.