强化学习驱动的海战场多智能体协同作战仿真算法

doi:10.16182/j.issn1004731x.joss.21-1321

系统仿真学报 ›› 2023, Vol. 35 ›› Issue (4): 786-796.doi: 10.16182/j.issn1004731x.joss.21-1321

• 论文 • 上一篇

强化学习驱动的海战场多智能体协同作战仿真算法

石鼎(), 燕雪峰, 宫丽娜, 张静宣, 关东海, 魏明强()

南京航空航天大学计算机科学与技术学院，江苏南京 211100

收稿日期:2021-12-20 修回日期:2022-03-01 出版日期:2023-04-29 发布日期:2023-04-12
通讯作者: 魏明强 E-mail:shiding0614@163.com;mqwei@nuaa.edu.cn
作者简介:石鼎(1996-)，男，硕士生，研究方向为多智能体强化学习。E-mail：shiding0614@163.com
基金资助:
国家自然科学基金面上项目(62172218)

Multi-agent Cooperative Combat Simulation in Naval Battlefield with Reinforcement Learning

Ding Shi(), Xuefeng Yan, Lina Gong, Jingxuan Zhang, Donghai Guan, Mingqiang Wei()

School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211100, China

Received:2021-12-20 Revised:2022-03-01 Online:2023-04-29 Published:2023-04-12
Contact: Mingqiang Wei E-mail:shiding0614@163.com;mqwei@nuaa.edu.cn

摘要/Abstract

摘要：

未来海战场形势瞬息万变，亟需依托人工智能技术实现对海战场环境的高质量作战仿真，以全面优化和提升我军战斗力，达成克敌制胜的目的。作战单元的协同合作是实现海战场作战仿真的关键环节，如何实现多智能体之间的均衡决策是作战仿真首要解决的问题。基于解耦的优先经验回放机制和注意力机制，提出强化学习驱动的多智能体协同作战仿真算法(multi-agent reinforcement learning-based cooperative combat simulation，MARL-CCSA)。在MARL-CCSA基础上，利用专家经验，设计一种多尺度奖励函数，并基于此函数构建一个海战场作战仿真环境，使MARL-CCSA在此环境中训练易于收敛。设计想定进行仿真实验，并与其他算法的效果进行对比，验证MARL-CCSA的可行性与实用性。

关键词: 作战仿真, 协同工作, 强化学习, 优先经验回放, 注意力机制, 多尺度奖励函数

Abstract:

Due to the rapidly-changed situations of future naval battlefields, it is urgent to realize the high-quality combat simulation in naval battlefields based on artificial intelligence to comprehensively optimize and improve the combat effectiveness of our army and defeat the enemy. The collaboration of combat units is the key point and how to realize the balanced decision-making among multiple agents is the first task. Based on decoupling priority experience replay mechanism and attention mechanism, a multi-agent reinforcement learning-based cooperative combat simulation (MARL-CCSA) network is proposed. Based on the expert experience, a multi-scale reward function is designed, on which a naval battlefield combat simulation environment is constructed. The proposed multi-scale reward function could speedthe convergence of multiple agents. The feasibility and practicability of MARL-CCSA is verified by the simulation experiment and the comparison with the other methods.

Key words: combat simulation, collaboration, reinforcement learning, prioritized experience replay, attention mechanism, multi-scale reward function

中图分类号:

TP391.9

石鼎, 燕雪峰, 宫丽娜, 张静宣, 关东海, 魏明强. 强化学习驱动的海战场多智能体协同作战仿真算法[J]. 系统仿真学报, 2023, 35(4): 786-796.

Ding Shi, Xuefeng Yan, Lina Gong, Jingxuan Zhang, Donghai Guan, Mingqiang Wei. Multi-agent Cooperative Combat Simulation in Naval Battlefield with Reinforcement Learning[J]. Journal of System Simulation, 2023, 35(4): 786-796.

图/表 10

图1

图2

图3

表1

实验参数设置

主要参数	量值
经验池容量 $M$	10⁵
批样本数 $B a t c h s i z e$	1 024
折扣因子 $γ$	0.95
Critic网络学习率 $α c$	0.01
Actor网络学习率 $α a$	0.01
软更新率 $τ$	0.01
最大回合数 $M a x E p i s o d e$	5 000
每回合步数 $S t e p P e r E p i s o d e$	25
安全距离阈值 $L / k m$	2
演示时间步长/s	0.1

表1

图4

图5

图6

图7

图8

图9

参考文献 21

1	Lowe R, Wu Y I, Tamar A, et al. Multi-agent Actor-critic for Mixed Cooperative-Competitive Environments[C]//Advances in Neural Information Processing Systems. San Francisco: Margan Kaufmann, 2017.
2	Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous Control with Deep Reinforcement Learning[C/OL]. International Conference on Learning Representations. 2016. [2022-06-11]. .
3	Rashid T, Samvelyan M, Schroeder C, et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning[C]//International Conference on Machine Learning. New York: PMLR, 2018: 4295-4304.
4	Watkins C J C H. Learning from Delayed Rewards[D]. London: King's College, 1989.
5	Rummery G A, Niranjan M. On-line Q-learning Using Connectionist Systems[M]. Cambridge, England: University of Cambridge, Department of Engineering, 1994.
6	Sutton R S, McAllester D A, Singh S P, et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//Advances in Neural Information Processing Systems. San Francisco: Margan Kaufmann, 2000: 1057-1063.
7	Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with Deep Reinforcement Learning[J/OL]. [2022-06-11]. .
8	Barto A G, Sutton R S, Anderson C W. Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems[J]. IEEE Transactions on Systems, Man, and Cybernetics(S0018-9472), 1983, 27(5): 834-846.
9	Hernandez-Leal P, Kartal B, Taylor M E. Is multiagent Deep Reinforcement Learning the Answer or the Question? A Brief Survey[J/OL]. [2022-06-11]. .
10	Tampuu A, Matiisen T, Kodelja D, et al. Multiagent Cooperation and Competition with Deep Reinforcement Learning[J]. Plos One(S1932-6203), 2017, 12(4): e0172395.
11	Gupta J K, Egorov M, Kochenderfer M. Cooperative Multi-agent Control Using Deep Reinforcement Learning[C]//International Conference on Autonomous Agents and Multiagent Systems. Cham: Springer, 2017: 66-83.
12	Foerster J N, Assael Y M, De Freitas N, et al. Learning to Communicate with Deep Multi-agent Reinforcement Learning[J]. [2022-06-11]. .
13	Sukhbaatar S, Fergus R. Learning Multi-agent Communication with Backpropagation[J]. Advances in Neural Information Processing Systems(S1049-5258), 2016, 29: 2244-2252.
14	Sunehag P, Lever G, Gruslys A, et al. Value-decomposition Networks for Cooperative Multi-agent Learning[J]. [2022-06-11] .
15	Foerster J, Nardelli N, Farquhar G, et al. Stabilising Experience Replay for Deep Multi-agent Reinforcement Learning[C]//International Conference on Machine learning. New York: PMLR, 2017: 1146-1155.
16	符小卫, 王辉, 徐哲. 基于DE-MADDPG的多无人机协同追捕策略研究[J]. 航空学报, 2022, 43(5): 325311.
	Fu Xiaowei, Wang Hui, Xu Zhe. Cooperative Pursuit Strategy for Multi-UAVs Based on DE-MADDPG Algorithm[J]. Acta Aeronauticaet Astronautica Sinica, 2022, 43(5): 325311.
17	Schaul T, Quan J, Antonoglou I, et al. Prioritized Experience Replay[J]. [2022-06-11]. .
18	Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]//Advances in Neural Information Processing Systems. San Francisco: Margan Kaufmann, 2017: 5998-6008.
19	Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141.
20	Iqbal S, Sha F. Actor-Attention-Critic for Multi-agent Reinforcement Learning[C]//International Conference on Machine Learning. New York: PMLR, 2019: 2961-2970.
21	Oh J, Chockalingam V, Lee H. Control of Memory, Active Perception, and Action in Minecraft[C]//International Conference on Machine Learning. New York: PMLR, 2016: 2790-2799.

[1]	徐浩添, 秦龙, 曾俊杰, 胡越, 张琪. 基于深度强化学习的对手建模方法研究综述[J]. 系统仿真学报, 2023, 35(4): 671-694.
[2]	徐颖, 张帅, 谢智歌, 徐新海, 孙曼晖, 郭宁. 基于三维剖分的机载雷达实时探测仿真方法[J]. 系统仿真学报, 2023, 35(2): 268-276.
[3]	向南, 王璐, 贾崇柳, 蹇越谋, 马小霞. 改进YOLO的遮挡行人检测仿真[J]. 系统仿真学报, 2023, 35(2): 286-299.
[4]	孙红, 张玉香, 凌岳览. 基于损失提取反馈注意网络的图像超分辨率重建研究[J]. 系统仿真学报, 2023, 35(2): 308-317.
[5]	丁柏圆, 穆富岭, 李云鹏, 陈忠宽, 刘承禹. 面向复杂电磁环境的体系作战仿真平台设计[J]. 系统仿真学报, 2023, 35(2): 330-338.
[6]	史佳洁, 杨鹏, 皮雁南. 基于机器学习的地铁行人流在线优化控制研究[J]. 系统仿真学报, 2023, 35(2): 386-395.
[7]	薛乃阳, 丁丹, 贾玉童, 王志强, 刘渊. 基于DQN的异构测控资源联合调度方法[J]. 系统仿真学报, 2023, 35(2): 423-434.
[8]	金炜东, 张述礼, 唐鹏, 张曼. 基于稠密残差块与通道像素注意力的图像去雾网络[J]. 系统仿真学报, 2022, 34(8): 1663-1673.
[9]	杨正, 向智敏, 马世文. 一种基于可变规则的松耦合实体建模方法[J]. 系统仿真学报, 2022, 34(7): 1506-1511.
[10]	马骏, 杨镜宇, 吴曦. 基于妥协案例推理的作战仿真实验范围迁移[J]. 系统仿真学报, 2022, 34(7): 1568-1581.
[11]	邱俊杰, 郑红, 程云辉. 基于多尺度LSTM预测模型研究[J]. 系统仿真学报, 2022, 34(7): 1593-1604.
[12]	王银, 王飞翔, 孙前来. 多尺度特征融合车辆检测方法[J]. 系统仿真学报, 2022, 34(6): 1219-1229.
[13]	赵也践, 王艳红, 张俊, 于洪霞, 田中大. 改进Q学习算法在作业车间调度问题中的应用[J]. 系统仿真学报, 2022, 34(6): 1247-1258.
[14]	张森, 张孟炎, 邵敬平, 普杰信. 基于随机策略搜索的多机三维路径规划方法[J]. 系统仿真学报, 2022, 34(6): 1286-1295.
[15]	倪凌佳, 黄晓霞, 李红旮, 张子博. 基于协作式深度强化学习的火灾应急疏散仿真研究[J]. 系统仿真学报, 2022, 34(6): 1353-1366.

强化学习驱动的海战场多智能体协同作战仿真算法

Multi-agent Cooperative Combat Simulation in Naval Battlefield with Reinforcement Learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 21

相关文章 15

编辑推荐

Metrics

本文评价