系统仿真学报 ›› 2026, Vol. 38 ›› Issue (5): 1277-1289.doi: 10.16182/j.issn1004731x.joss.25-0743

• • 上一篇    

基于次优示例引导的兵棋推演多智能体强化学习方法

周子聪, 曾俊杰, 胡越, 朱正秋, 尹全军   

  1. 国防科技大学 系统工程学院,湖南 长沙 410073
  • 收稿日期:2025-08-03 修回日期:2025-12-04 出版日期:2026-05-21 发布日期:2026-05-29
  • 通讯作者: 曾俊杰
  • 第一作者简介:周子聪(2001-),男,硕士生,研究方向为系统仿真、多智能体系统等。
  • 基金资助:
    国家自然科学基金(62306329)

Multi-agent Reinforcement Learning Method for Wargame Simulation Based on Suboptimal Demonstration Guidance

Zhou Zicong, Zeng Junjie, Hu Yue, Zhu Zhengqiu, Yin Quanjun   

  1. College of Systems Engineering, National University of Denfense Technology, Changsha 410073, China
  • Received:2025-08-03 Revised:2025-12-04 Online:2026-05-21 Published:2026-05-29
  • Contact: Zeng Junjie

摘要:

针对传统兵棋智能体决策模型在复杂对抗环境中易出现行为模式固化与适应性不足等问题,提出了基于次优示例引导的多智能体强化学习(multi-agent reinforcement learning from suboptimal demonstrations, MARLSD)方法。该方法融合了奖励重标记方法和自模仿学习机制,能够通过少量次优示例有效提升多智能体强化学习算法在大规模状态‒动作空间、稀疏奖励环境中的训练效率,同时鼓励智能体进一步探索更优策略。实验结果表明:MARLSD相较QMIX、MAGAIL等基线算法,算法性能有效提升,训练效率明显提高,可适配多种值分解多智能体强化学习算法,且仅需少量次优示例轨迹便能达成良好效果。

关键词: 次优示例, 稀疏奖励, 自模仿学习, 兵棋推演, 多智能体强化学习

Abstract:

To address issues such as fixed behavior patterns and insufficient adaptability in complex adversarial environments exhibited by traditional wargame agent decision-making models, this paper proposes a multi-agent reinforcement learning method based on suboptimal demonstrations (MARLSD). The proposed method integrates reward relabeling with a self-imitation learning mechanism, effectively improving the training efficiency of multi-agent reinforcement learning algorithms in environments with large state-action spaces and sparse rewards, even when only a small number of suboptimal demonstrations are available, while encouraging agents to explore better strategies. Experimental results show that, compared with baselines such as QMIX and MAGAIL, MARLSD significantly improves performance and training efficiency, adapts to various value-decomposition multi-agent reinforcement learning algorithms, and achieves strong results using only a small number of suboptimal demonstration trajectories.

Key words: suboptimal demonstration, sparse reward, self-imitation learning, wargame simulation, multi-agent reinforcement learning

中图分类号: