Multi-agent Reinforcement Learning Method for Wargame Simulation Based on Suboptimal Demonstration Guidance

doi:10.16182/j.issn1004731x.joss.25-0743

Abstract

Abstract:

To address issues such as fixed behavior patterns and insufficient adaptability in complex adversarial environments exhibited by traditional wargame agent decision-making models, this paper proposes a multi-agent reinforcement learning method based on suboptimal demonstrations (MARLSD). The proposed method integrates reward relabeling with a self-imitation learning mechanism, effectively improving the training efficiency of multi-agent reinforcement learning algorithms in environments with large state-action spaces and sparse rewards, even when only a small number of suboptimal demonstrations are available, while encouraging agents to explore better strategies. Experimental results show that, compared with baselines such as QMIX and MAGAIL, MARLSD significantly improves performance and training efficiency, adapts to various value-decomposition multi-agent reinforcement learning algorithms, and achieves strong results using only a small number of suboptimal demonstration trajectories.

Key words: suboptimal demonstration, sparse reward, self-imitation learning, wargame simulation, multi-agent reinforcement learning

CLC Number:

TP391.9

Zhou Zicong, Zeng Junjie, Hu Yue, Zhu Zhengqiu, Yin Quanjun. Multi-agent Reinforcement Learning Method for Wargame Simulation Based on Suboptimal Demonstration Guidance[J]. Journal of System Simulation, 2026, 38(5): 1277-1289.

Figures/Tables 15

Fig. 1

Fig. 2

Fig. 3

Table 1

Fig. 4

Fig. 5

Table 2

Table 3

Table 4

Experimental hyper parameter setting

参数名	参数值
学习率	0.001
折扣率	0.95
经验池容量	1 000 000
批次数	512
探索率初始值	1
$λ$ 初始值	1
初始次优示例轨迹数	20

Table 4

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

References 24

[1]	尹奇跃, 赵美静, 倪晚成, 等. 兵棋推演的智能决策技术与挑战[J]. 自动化学报, 2023, 49(5): 913-928.
	Yin Qiyue, Zhao Meijing, Ni Wancheng, et al. Intelligent Decision Making Technology and Challenge of Wargame[J]. Acta Automatica Sinica, 2023, 49(5): 913-928.
[2]	罗俊仁, 张万鹏, 项凤涛, 等. 智能推演综述: 博弈论视角下的战术战役兵棋与战略博弈[J]. 系统仿真学报, 2023, 35(9): 1871-1894.
	Luo Junren, Zhang Wanpeng, Xiang Fengtao, et al. Survey on Intelligent Wargaming: Tactical & Campaign Wargame and Strategic Game from Game-theoretic Perspective[J]. Journal of System Simulation, 2023, 35(9): 1871-1894.
[3]	Vinyals O, Ewalds T, Bartunov S, et al. StarCraft II: A New Challenge for Reinforcement Learning[EB/OL]. (2017-08-16) [2025-04-16]. .
[4]	中国科学院. 庙算 ⋅ 陆战指挥官[EB/OL]. [2025-04-16]. .
[5]	Luo Haowen, Lee Chang-Hun, Li Chaoyong, et al. Generative Adversarial Imitation Learning-based Continuous Learning Computational Guidance[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(3): 6809-6821.
[6]	Yu Lantao, Song Jiaming, Ermon S. Multi-agent Adversarial Inverse Reinforcement Learning[C]//Proceedings of the 36th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2019: 7194-7201.
[7]	Peng Yong, Zeng Junjie, Hu Yue, et al. Reinforcement Learning from Suboptimal Demonstrations Based on Reward Relabeling[J]. Expert Systems with Applications, 2024, 255, Part B: 124580.
[8]	Ross Stéphane, Gordon G J, Bagnell J A. A Reduction of Imitation Learning and Structured Prediction to No-regret Online Learning[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Chia Laguna Resort: PMLR, 2011: 627-635.
[9]	Zhan E, Zheng S, Yue Yisong, et al. Generating Multi-agent Trajectories Using Programmatic Weak Supervision[EB/OL]. (2019-02-22) [2025-04-16]. .
[10]	Wang Hongwei, Yu Lantao, Cao Zhangjie, et al. Multi-agent Imitation Learning with Copulas[C]//Machine Learning and Knowledge Discovery in Databases. Research Track. Cham: Springer International Publishing, 2021: 139-156.
[11]	Oh J, Guo Yijie, Singh S, et al. Self-imitation Learning[EB/OL]. (2018-06-14) [2025-04-16]. .
[12]	Hao Peng, Lu Tao, Cui Shaowei, et al. SOZIL: Self-optimal Zero-shot Imitation Learning[J]. IEEE Transactions on Cognitive and Developmental Systems, 2023, 15(4): 2077-2088.
[13]	Lee Donghun, Park In-Beom, Kim Kwanho. A Self-imitation Learning Approach for Scheduling Evaporation and Encapsulation Stages of OLED Display Manufacturing Systems[J]. Robotics and Computer-Integrated Manufacturing, 2025, 93: 102917.
[14]	Tampuu Ardi, Matiisen Tambet, Kodelja Dorian, et al. Multiagent Cooperation and Competition with Deep Reinforcement Learning[J]. PLoS One, 2017, 12(4): e0172395.
[15]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level Control Through Deep Reinforcement Learning[J]. Nature, 2015, 518(7540): 529-533.
[16]	Gupta J K, Egorov M, Kochenderfer M. Cooperative Multi-agent Control Using Deep Reinforcement Learning[C]//International Conference on Autonomous Agents and Multiagent Systems. Cham: Springer International Publishing, 2017: 66-83.
[17]	Kraemer L, Banerjee B. Multi-agent Reinforcement Learning as a Rehearsal for Decentralized Planning[J]. Neurocomputing, 2016, 190: 82-94.
[18]	Sunehag P, Lever G, Gruslys A, et al. Value-decomposition Networks for Cooperative Multi-agent Learning[EB/OL]. (2017-06-16) [2025-04-16]. .
[19]	Rashid T, Samvelyan Mikayel, Christian Schroeder De Witt, et al. Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning[J]. The Journal of Machine Learning Research, 2020, 21(1): 178.
[20]	Wang Jianhao, Ren Zhizhou, Liu T, et al. QPLEX: Duplex Dueling Multi-agent Q-learning[EB/OL]. (2021-10-04) [2025-04-16]. .
[21]	Son Kyunghwan, Kim Daewoo, Wan Ju Kang, et al. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-agent Reinforcement Learning[C]//Proceedings of the 36th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2019: 5887-5896.
[22]	Tang Hongyao, Hao Jianye, Tangjie Lü, et al. Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction[EB/OL]. (2019-07-04) [2025-04-16]. .
[23]	Rashid T, Farquhar G, Peng Bei, et al. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 10199-10210.
[24]	Bernstein D S, Givan R, Immerman N, et al. The Complexity of Decentralized Control of Markov Decision Processes[J]. Mathematics of Operations Research, 2002, 27(4): 819-840.

算子类型	红方数量	蓝方数量	武器配置	行进间射击能力
重型坦克	2	2	中号直瞄炮车载轻武器炮射导弹	是
重型战车	0	2	小号直瞄炮车载导弹车载轻武器	否
中型战车	1	0	重型导弹速射炮车载轻武器	否
无人战车	1	0	速射炮车载轻武器中型导弹	否
步兵小队	2	2	便携导弹人员轻武器火箭筒	否

动作名称	动作描述	动作序号
无动作	不采取任何动作	0
机动动作	向周围六角格移动	1~6
射击动作	打击敌方目标	7
掩蔽动作	切换掩蔽状态	8
夺控动作	夺取阵地	9
下车动作	步兵小队离开载具	10

任务名称	任务奖励
击毁重型坦克	40
击毁中型战车	24
击毁无人战车	24
歼灭步兵小队	16
占领主要阵地	80
占领次要阵地	50