基于次优示例引导的兵棋推演多智能体强化学习方法

doi:10.16182/j.issn1004731x.joss.25-0743

摘要/Abstract

摘要：

针对传统兵棋智能体决策模型在复杂对抗环境中易出现行为模式固化与适应性不足等问题，提出了基于次优示例引导的多智能体强化学习(multi-agent reinforcement learning from suboptimal demonstrations, MARLSD)方法。该方法融合了奖励重标记方法和自模仿学习机制，能够通过少量次优示例有效提升多智能体强化学习算法在大规模状态‒动作空间、稀疏奖励环境中的训练效率，同时鼓励智能体进一步探索更优策略。实验结果表明：MARLSD相较QMIX、MAGAIL等基线算法，算法性能有效提升，训练效率明显提高，可适配多种值分解多智能体强化学习算法，且仅需少量次优示例轨迹便能达成良好效果。

关键词: 次优示例, 稀疏奖励, 自模仿学习, 兵棋推演, 多智能体强化学习

Abstract:

To address issues such as fixed behavior patterns and insufficient adaptability in complex adversarial environments exhibited by traditional wargame agent decision-making models, this paper proposes a multi-agent reinforcement learning method based on suboptimal demonstrations (MARLSD). The proposed method integrates reward relabeling with a self-imitation learning mechanism, effectively improving the training efficiency of multi-agent reinforcement learning algorithms in environments with large state-action spaces and sparse rewards, even when only a small number of suboptimal demonstrations are available, while encouraging agents to explore better strategies. Experimental results show that, compared with baselines such as QMIX and MAGAIL, MARLSD significantly improves performance and training efficiency, adapts to various value-decomposition multi-agent reinforcement learning algorithms, and achieves strong results using only a small number of suboptimal demonstration trajectories.

Key words: suboptimal demonstration, sparse reward, self-imitation learning, wargame simulation, multi-agent reinforcement learning

中图分类号:

TP391.9

周子聪,曾俊杰,胡越等 . 基于次优示例引导的兵棋推演多智能体强化学习方法[J]. 系统仿真学报, 2026, 38(5): 1277-1289.

Zhou Zicong,Zeng Junjie,Hu Yue,et al . Multi-agent Reinforcement Learning Method for Wargame Simulation Based on Suboptimal Demonstration Guidance[J]. Journal of System Simulation, 2026, 38(5): 1277-1289.

图/表 15

图1

图2

图 3

表1

图4

图 5

表2

表3

表4

实验超参数设计

参数名	参数值
学习率	0.001
折扣率	0.95
经验池容量	1 000 000
批次数	512
探索率初始值	1
$λ$ 初始值	1
初始次优示例轨迹数	20

表4

图6

图7

图8

图9

图 10

图 11

参考文献 24

[1]	尹奇跃, 赵美静, 倪晚成, 等. 兵棋推演的智能决策技术与挑战[J]. 自动化学报, 2023, 49(5): 913-928.
	Yin Qiyue, Zhao Meijing, Ni Wancheng, et al. Intelligent Decision Making Technology and Challenge of Wargame[J]. Acta Automatica Sinica, 2023, 49(5): 913-928.
[2]	罗俊仁, 张万鹏, 项凤涛, 等. 智能推演综述: 博弈论视角下的战术战役兵棋与战略博弈[J]. 系统仿真学报, 2023, 35(9): 1871-1894.
	Luo Junren, Zhang Wanpeng, Xiang Fengtao, et al. Survey on Intelligent Wargaming: Tactical & Campaign Wargame and Strategic Game from Game-theoretic Perspective[J]. Journal of System Simulation, 2023, 35(9): 1871-1894.
[3]	Vinyals O, Ewalds T, Bartunov S, et al. StarCraft II: A New Challenge for Reinforcement Learning[EB/OL]. (2017-08-16) [2025-04-16]. .
[4]	中国科学院. 庙算 ⋅ 陆战指挥官[EB/OL]. [2025-04-16]. .
[5]	Luo Haowen, Lee Chang-Hun, Li Chaoyong, et al. Generative Adversarial Imitation Learning-based Continuous Learning Computational Guidance[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(3): 6809-6821.
[6]	Yu Lantao, Song Jiaming, Ermon S. Multi-agent Adversarial Inverse Reinforcement Learning[C]//Proceedings of the 36th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2019: 7194-7201.
[7]	Peng Yong, Zeng Junjie, Hu Yue, et al. Reinforcement Learning from Suboptimal Demonstrations Based on Reward Relabeling[J]. Expert Systems with Applications, 2024, 255, Part B: 124580.
[8]	Ross Stéphane, Gordon G J, Bagnell J A. A Reduction of Imitation Learning and Structured Prediction to No-regret Online Learning[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Chia Laguna Resort: PMLR, 2011: 627-635.
[9]	Zhan E, Zheng S, Yue Yisong, et al. Generating Multi-agent Trajectories Using Programmatic Weak Supervision[EB/OL]. (2019-02-22) [2025-04-16]. .
[10]	Wang Hongwei, Yu Lantao, Cao Zhangjie, et al. Multi-agent Imitation Learning with Copulas[C]//Machine Learning and Knowledge Discovery in Databases. Research Track. Cham: Springer International Publishing, 2021: 139-156.
[11]	Oh J, Guo Yijie, Singh S, et al. Self-imitation Learning[EB/OL]. (2018-06-14) [2025-04-16]. .
[12]	Hao Peng, Lu Tao, Cui Shaowei, et al. SOZIL: Self-optimal Zero-shot Imitation Learning[J]. IEEE Transactions on Cognitive and Developmental Systems, 2023, 15(4): 2077-2088.
[13]	Lee Donghun, Park In-Beom, Kim Kwanho. A Self-imitation Learning Approach for Scheduling Evaporation and Encapsulation Stages of OLED Display Manufacturing Systems[J]. Robotics and Computer-Integrated Manufacturing, 2025, 93: 102917.
[14]	Tampuu Ardi, Matiisen Tambet, Kodelja Dorian, et al. Multiagent Cooperation and Competition with Deep Reinforcement Learning[J]. PLoS One, 2017, 12(4): e0172395.
[15]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level Control Through Deep Reinforcement Learning[J]. Nature, 2015, 518(7540): 529-533.
[16]	Gupta J K, Egorov M, Kochenderfer M. Cooperative Multi-agent Control Using Deep Reinforcement Learning[C]//International Conference on Autonomous Agents and Multiagent Systems. Cham: Springer International Publishing, 2017: 66-83.
[17]	Kraemer L, Banerjee B. Multi-agent Reinforcement Learning as a Rehearsal for Decentralized Planning[J]. Neurocomputing, 2016, 190: 82-94.
[18]	Sunehag P, Lever G, Gruslys A, et al. Value-decomposition Networks for Cooperative Multi-agent Learning[EB/OL]. (2017-06-16) [2025-04-16]. .
[19]	Rashid T, Samvelyan Mikayel, Christian Schroeder De Witt, et al. Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning[J]. The Journal of Machine Learning Research, 2020, 21(1): 178.
[20]	Wang Jianhao, Ren Zhizhou, Liu T, et al. QPLEX: Duplex Dueling Multi-agent Q-learning[EB/OL]. (2021-10-04) [2025-04-16]. .
[21]	Son Kyunghwan, Kim Daewoo, Wan Ju Kang, et al. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-agent Reinforcement Learning[C]//Proceedings of the 36th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2019: 5887-5896.
[22]	Tang Hongyao, Hao Jianye, Tangjie Lü, et al. Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction[EB/OL]. (2019-07-04) [2025-04-16]. .
[23]	Rashid T, Farquhar G, Peng Bei, et al. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 10199-10210.
[24]	Bernstein D S, Givan R, Immerman N, et al. The Complexity of Decentralized Control of Markov Decision Processes[J]. Mathematics of Operations Research, 2002, 27(4): 819-840.

算子类型	红方数量	蓝方数量	武器配置	行进间射击能力
重型坦克	2	2	中号直瞄炮车载轻武器炮射导弹	是
重型战车	0	2	小号直瞄炮车载导弹车载轻武器	否
中型战车	1	0	重型导弹速射炮车载轻武器	否
无人战车	1	0	速射炮车载轻武器中型导弹	否
步兵小队	2	2	便携导弹人员轻武器火箭筒	否

动作名称	动作描述	动作序号
无动作	不采取任何动作	0
机动动作	向周围六角格移动	1~6
射击动作	打击敌方目标	7
掩蔽动作	切换掩蔽状态	8
夺控动作	夺取阵地	9
下车动作	步兵小队离开载具	10

任务名称	任务奖励
击毁重型坦克	40
击毁中型战车	24
击毁无人战车	24
歼灭步兵小队	16
占领主要阵地	80
占领次要阵地	50