基于BiGRU与优先级动态采样的智能空战决策方法

doi:10.16182/j.issn1004731x.joss.25-0472

摘要/Abstract

摘要：

为解决多智能体强化学习算法存在经验数据利用效率低、学习率难以设置的问题，提出一种基于优先级采样和动态学习率的BiGRU多智能体近端策略优化算法。引入BiGRU网络，增强了策略网络对时序信息的建模能力；引入优先级部分采样机制，提高了对高价值经验数据的利用效率；采用改进Adam优化器，动态调整学习率，解决了学习率难以设置的问题。仿真实验结果表明：该算法在收敛速度、稳定性和作战胜率方面均有提高，为多智能体空战决策提供了新的优化方案。

关键词: 空战策略优化, 优先级部分采样, 动态学习率, 深度强化学习

Abstract:

Current multi-agent reinforcement learning algorithms suffer from low efficiency in utilizing experience data and difficulties in setting appropriate learning rates. To address these issues, this paper proposed a BiGRU multi-agent PPO with priority sampling and dynamic learning rate. The algorithm incorporated a BiGRU network to enhance the policy network's ability to model temporal information. A priority partial sampling mechanism was introduced to improve the utilization efficiency of high-value experience data. Additionally, an improved Adam optimizer with dynamic learning rate adjustment was employed to address the challenge of learning rate configuration. Simulation experiment results demonstrate that the algorithm significantly enhances convergence speed, stability, and combat win rate, offering a novel optimization scheme for multi-agent air combat decision-making.

Key words: air combat policy optimization, priority partial sampling, dynamic learning rate, deep reinforcement learning

中图分类号:

TP391.9

丁拯坤,刘佳奇,徐军政等 . 基于BiGRU与优先级动态采样的智能空战决策方法[J]. 系统仿真学报, 2026, 38(2): 447-459.

Ding Zhengkun,Liu Jiaqi,Xu Junzheng,et al . Intelligent Air Combat Decision-making Method Based on BiGRU and Priority Dynamic Sampling[J]. Journal of System Simulation, 2026, 38(2): 447-459.

图/表 13

表1

无人机基本动作和控制值对照表

动作序号	基本动作	控制值
动作序号	基本动作	$n x$	$n z$	$γ$
$a 1$	匀速直行	0	1	0
$a 2$	加速直行	2	1	0
$a 3$	减速直行	1	0	0
$a 4$	匀速左转	0	8	-arccos(1/8)
$a 5$	加速左转	2	8	-arccos(1/8)
$a 6$	减速左转	1	8	-arccos(1/8)
$a 7$	匀速右转	0	8	arccos(1/8)
$a 8$	加速右转	2	8	arccos(1/8)
$a 9$	减速右转	1	8	arccos(1/8)
$a 10$	匀速上仰	0	8	0
$a 11$	加速上仰	2	8	0
$a 12$	减速上仰	1	8	0
$a 13$	匀速俯冲	0	8	$π$
$a 14$	加速俯冲	2	8	$π$
$a 15$	减速俯冲	1	8	$π$

表1

图1

图2

图3

表2

空战仿真参数

参数	取值
神经网络学习率l	0.000 5
折扣因子 $γ$	0.99
经验缓冲池最大容量	2 000
网络模型训练步数	20 000
回合时间步数	200
仿真时间步长/s	0.2
最大攻击距离 $d m a x$ /m	2 000

表2

图4

图5

图6

图7

图8

表3

图9

图10

参考文献 28

[1]	Li Shouyi, Chen Mou, Wang Yuhui, et al. Air Combat Decision-making of Multiple UCAVs Based on Constraint Strategy Games[J]. Defence Technology, 2022, 18(3): 368-383.
[2]	雍宇晨, 李子豫, 董琦. 基于分层多智能体强化学习的多无人机视距内空战[J]. 智能系统学报, 2025, 20(3): 548-556.
	Yong Yuchen, Li Ziyu, Dong Qi. Multi-UAV Within-visual-range Air Combat Based on Hierarchical Multiagent Reinforcement Learning[J]. CAAI Transactions on Intelligent Systems, 2025, 20(3): 548-556.
[3]	Wu Mingxi. Intelligent Warfare: Prospects of Military Development in the Age of AI[M]. London: Routledge, 2022.
[4]	Murat Perit Çakır, Gürakar Gökhan. Towards Intelligent Flight Simulator Training[J]. The Journal of the JAPCC, 2023, 36: 46-53.
[5]	Jordan Javier. The Future of Unmanned Combat Aerial Vehicles: an Analysis Using the Three Horizons Framework[J]. Futures, 2021, 134: 102848.
[6]	梁晓龙, 杨爱武, 张佳强, 等. 无人集群博弈对抗系统仿真验证及决策关键技术综述[J]. 系统仿真学报, 2024, 36(4): 805-816.
	Liang Xiaolong, Yang Aiwu, Zhang Jiaqiang, et al. Simulation Verification and Decision-making Key Technologies of Unmanned Swarm Game Confrontation: A Survey[J]. Journal of System Simulation, 2024, 36(4): 805-816.
[7]	Li Yuxi. Deep Reinforcement Learning: An Overview[EB/OL]. (2017-01-25) [2025-05-02]. .
[8]	Wang Xinwei, Wang Yihui, Su Xichao, et al. Deep Reinforcement Learning-based Air Combat Maneuver Decision-making: Literature Review, Implementation Tutorial and Future Direction[J]. Artificial Intelligence Review, 2024, 57(1): 1.
[9]	BENGIO Y, GOODFELLOW I, COURVILLE A. Deep Learning[M].Cambridge, Massachusetts: University Press of the Massachusetts Institute of Technology, 2017.
[10]	SUTTON R S, BARTO A G. Reinforcement Learning: An Introduction[M]. Cambridge, Massachusetts: University Press of the Massachusetts Institute of Technology, 2018.
[11]	Li Yurui, Chen Yuxuan, Zhang Li, et al. The Composite Task Challenge for Cooperative Multi-Agent Reinforcement Learning[EB/OL]. (2025-02-01) [2025-05-02]. .
[12]	施伟, 冯旸赫, 程光权, 等. 基于深度强化学习的多机协同空战方法研究[J]. 自动化学报, 2021, 47(7): 1610-1623.
	Shi Wei, Feng Yanghe, Cheng Guangquan, et al. Research on Multi-aircraft Cooperative Air Combat Method Based on Deep Reinforcement Learning[J]. Acta Automatica Sinica, 2021, 47(7): 1610-1623.
[13]	Foerster J N, Farquhar G, Afouras T, et al. Counterfactual Multi-agent Policy Gradients[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2018: 363.
[14]	Li Shaowei, Jia Yuhong, Yang Fan, et al. Collaborative Decision-making Method for Multi-UAV Based on Multiagent Reinforcement Learning[J]. IEEE Access, 2022, 10: 91385-91396.
[15]	Liu Xiaoxiong, Yin Yi, Su Yuzhan, et al. A Multi-UCAV Cooperative Decision-making Method Based on an MAPPO Algorithm for Beyond-visual-range Air Combat[J]. Aerospace, 2022, 9(10): 563.
[16]	Xiaohong Nian, Li Mengmeng, Wang Haibo, et al. Large-scale UAV Swarm Confrontation Based on Hierarchical Attention Actor-critic Algorithm[J]. Applied Intelligence, 2024, 54(4): 3279-3294.
[17]	符小卫, 王辉, 徐哲. 基于DE-MADDPG的多无人机协同追捕策略[J]. 航空学报, 2022, 43(5): 522-535.
	Fu Xiaowei, Wang Hui, Xu Zhe. Cooperative Pursuit Strategy for Multi-UAVs Based on DE-MADDPG Algorithm[J]. Acta Aeronautica et Astronautica Sinica, 2022, 43(5): 522-535.
[18]	陈灿, 莫雳, 郑多, 等. 非对称机动能力多无人机智能协同攻防对抗[J]. 航空学报, 2020, 41(12): 336-348.
	Chen Can, Mo Li, Zheng Duo, et al. Cooperative Attack-defense Game of Multiple UAVs with Asymmetric Maneuverability[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(12): 336-348.
[19]	孙智孝, 杨晟琦, 朴海音, 等. 未来智能空战发展综述[J]. 航空学报, 2021, 42(8): 28-42.
	Sun Zhixiao, Yang Shengqi, Haiyin Piao, et al. A Survey of Air Combat Artificial Intelligence[J]. Acta Aeronautica et Astronautica Sinica, 2021, 42(8): 28-42.
[20]	TALAY T A. Introduction to the Aerodynamics of Flight[EB/OL]. (1975-01-01) [2025-05-02]. .
[21]	SHAW R L. Fighter Combat[M]. Annapolis, Maryland: Tactics and Maneuvering, 1985: 62-97.
[22]	Zheng Zhiqiang, Duan Haibin. UAV Maneuver Decision-making Via Deep Reinforcement Learning for Short-range Air Combat[J]. Intelligence & Robotics, 2023, 3(1): 76-94.
[23]	Yang Qiming, Zhang Jiandong, Shi Guoqing, et al. Maneuver Decision of UAV in Short-range Air Combat Based on Deep Reinforcement Learning[J]. IEEE Access, 2020, 8: 363-378.
[24]	Cho Kyunghyun, van Merriënboer Bart, Gulcehre Caglar, et al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg: ACL, 2014: 1724-1734.
[25]	SCHAUL T, Quan J, ANTONOGLOU I, et al. Prioritized Experience Replay[C]//4th International Conference on Learning Representations. Puerto Rico: ICLR, 2016: 1-13.
[26]	Sutton R S. Learning to Predict by the Methods of Temporal Differences[J]. Machine Learning, 1988, 3(1): 9-44.
[27]	KINGMA D P, Ba J. Adam: A Method for Stochastic Optimization[C]//3rd International Conference on Learning Representations. San Diego: ICLR 2015: 1-15.
[28]	Lowe Ryan, Wu Yi, Tamar A, et al. Multi-agent Actor-critic for Mixed Cooperative-competitive Environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6382-6393.

算法名	GRU 网络	学习率调整	优先级采样
MAPPO	×	×	×
MAPPO-GRU	单向	×	×
MAPPO-BiGRU	双向	×	×
MAPPO-BiGRU-PS	双向	×	√
MAPPO-BiGRU-LR	双向	√	×
MAPPO-BiGRU-PS-LR	双向	√	√