Curriculum Learning-based Simulation of UAV Air Combat Under Sparse Rewards

doi:10.16182/j.issn1004731x.joss.23-0349

Abstract

Abstract:

To address the limited exploration capabilities and sparse rewards of conventional reinforcement learning methods in air combat environment, a curriculum learning distributed proximal policy optimization (CLDPPO) reinforcement learning algorithm is proposed. A reward function informed by professional empirical knowledge is integrated, a discrete action space is developed, and a global observation and local value and decision network featuring separated global and local observations is established. A methodology for unmanned aerial vehicles UAVs is presented to acquire combat expertise through a sequence of fundamental courses that progressively intensify in their offensive, defensive, and comprehensive content. The experimental results show that the methodology surpasses the specialist system and the other mainstream reinforcement learning algorithms, which has the ability of the autonomous acquisition of air warfare tactics and can enhance the sparse rewards.

Key words: UAVs, air combat, sparse reward, curriculum learning, distributed proximal policy optimization (DPPO)

CLC Number:

TP391.9

Zhu Jingyu, Zhang Hongli, Kuang Minchi, Shi Heng, Zhu Jihong, Qiao zhi, Zhou Wenqing. Curriculum Learning-based Simulation of UAV Air Combat Under Sparse Rewards[J]. Journal of System Simulation, 2024, 36(6): 1452-1467.

Figures/Tables 26

Fig. 1

Fig. 2

Table 1

Fig. 3

Fig. 4

Table 2

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Fig. 12

Fig. 13

Fig. 14

Fig. 15

Fig. 16

Fig. 17

Fig. 18

Table 3

Table 4

Hyperparameter design of CLDPPO

参数名	参数值
用于优化的GPU数量	5
批次数量	120
经验库容量	480
衰减系数	0.99
GAE $λ$	0.95
PPO 裁剪系数	0.2
数据复用率	1
模型版本替换频率	2
Actor 学习率	0.000 5
Critic 学习率	0.000 2
动作的熵系数	0.01

Table 4

Fig. 19

Table 5

Fig. 20

Table 6

References 25

1	孙智孝, 杨晟琦, 朴海音, 等. 未来智能空战发展综述[J]. 航空学报, 2021, 42(8): 28-42.
	Sun Zhixiao, Yang Shengqi, Haiyin Piao, et al. A Survey of Air Combat Artificial Intelligence[J]. Acta Aeronautica et Astronautica Sinica, 2021, 42(8): 28-42.
2	Burgin G H, Owens A J. An Adaptive Maneuvering Logic Computer Program for the Simulation of One-to-one Air-to-air Combat. Volume 2: Program Description[EB/OL]. (1975-09-01)[2020-10-02]. .
3	Geng Wenxue, Kong Fan'e, Ma Dongqian. Study on Tactical Decision of UAV Medium-range Air Combat[C]//The 26th Chinese Control and Decision Conference (2014 CCDC). Piscataway, NJ, USA: IEEE, 2014: 135-139.
4	Li Shouyi, Chen Mou, Wang Yuhui, et al. Air Combat Decision-making of Multiple UCAVs Based on Constraint Strategy Games[J]. Defence Technology, 2022, 18(3): 368-383.
5	Li Weihua, Shi Jingping, Wu Yunyan, et al. A Multi-UCAV Cooperative Occupation Method Based on Weapon Engagement Zones for Beyond-visual-range Air Combat[J]. Defence Technology, 2022, 18(6): 1006-1022.
6	左家亮, 杨任农, 张滢, 等. 基于启发式强化学习的空战机动智能决策[J]. 航空学报, 2017, 38(10): 212-225.
	Zuo Jialiang, Yang Rennong, Zhang Ying, et al. Intelligent Decision-making in Air Combat Maneuvering Based on Heuristic Reinforcement Learning[J]. Acta Aeronautica et Astronautica Sinica, 2017, 38(10): 212-225.
7	Liu Pin, Ma Yaofei. A Deep Reinforcement Learning Based Intelligent Decision Method for UCAV Air Combat[C]//Modeling, Design and Simulation of Systems. Singapore: Springer Singapore, 2017: 274-286.
8	Yang Qiming, Zhu Yan, Zhang Jiandong, et al. UAV Air Combat Autonomous Maneuver Decision Based on DDPG Algorithm[C]//2019 IEEE 15th International Conference on Control and Automation (ICCA). Piscataway, NJ, USA: IEEE, 2019: 37-42.
9	Lei Xie, Ding Dali, Wei Zhenglei, et al. Moving Time UCAV Maneuver Decision Based on the Dynamic Relational Weight Algorithm and Trajectory Prediction[J]. Mathematical Problems in Engineering, 2021, 2021: 6641567.
10	Pope A P, Ide J S, Mićović Daria, et al. Hierarchical Reinforcement Learning for Air-to-air Combat[C]//2021 International Conference on Unmanned Aircraft Systems (ICUAS). Piscataway, NJ, USA: IEEE, 2021: 275-284.
11	Crumpacker J B, Robbins M J, Jenkins P R. An Approximate Dynamic Programming Approach for Solving an Air Combat Maneuvering Problem[J]. Expert Systems with Applications, 2022, 203: 117448.
12	曾贲, 房霄, 孔德帅, 等. 一种数据驱动的对抗博弈智能体建模方法[J]. 系统仿真学报, 2021, 33(12): 2838-2845.
	Zeng Ben, Fang Xiao, Kong Deshuai, et al. A Data-driven Modeling Method for Game Adversity Agent[J]. Journal of System Simulation, 2021, 33(12): 2838-2845.
13	赵毓, 郭继峰, 颜鹏, 等. 稀疏奖励下多航天器规避决策自学习仿真[J]. 系统仿真学报, 2021, 33(8): 1766-1774.
	Zhao Yu, Guo Jifeng, Yan Peng, et al. Self-learning-based Multiple Spacecraft Evasion Decision Making Simulation Under Sparse Reward Condition[J]. Journal of System Simulation, 2021, 33(8): 1766-1774.
14	Haiyin Piao, Sun Zhixiao, Meng Guanglei, et al. Beyond-visual-range Air Combat Tactics Auto-generation by Reinforcement Learning[C]//2020 International Joint Conference on Neural Networks (IJCNN). Piscataway, NJ, USA: IEEE, 2020: 1-8.
15	施伟, 冯旸赫, 程光权, 等. 基于深度强化学习的多机协同空战方法研究[J]. 自动化学报, 2021, 47(7): 1610-1623.
	Shi Wei, Feng Yanghe, Cheng Guangquan, et al. Research on Multi-aircraft Cooperative Air Combat Method Based on Deep Reinforcement Learning[J]. Acta Automatica Sinica, 2021, 47(7): 1610-1623.
16	Kong Weiren, Zhou Deyun, Yang Zhen. Air Combat Strategies Generation of CGF Based on MADDPG and Reward Shaping[C]//2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL). Piscataway, NJ, USA: IEEE, 2020: 651-655.
17	McGrew J S, How J P, Williams B, et al. Air-combat Strategy Using Approximate Dynamic Programming[J]. Journal of Guidance, Control, and Dynamics, 2010, 33(5): 1641-1654.
18	周文卿, 朱纪洪, 匡敏驰, 等. 基于预知博弈树的多无人机群智协同空战算法[J]. 中国科学(技术科学), 2023, 53(2): 187-199.
	Zhou Wenqing, Zhu Jihong, Kuang Minchi, et al. Multi-UAV Cooperative Swarm Algorithm in Air Combat Based on Predictive Game Tree[J]. Scientia Sinica(Technologica), 2023, 53(2): 187-199.
19	Bengio Y, Louradour Jérôme, Collobert R, et al. Curriculum Learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning. New York, NY, USA: Association for Computing Machinery, 2009: 41-48.
20	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-08-28) [2021-01-21]. .
21	Schulman J, Moritz P, Levine S, et al. High-dimensional Continuous Control Using Generalized Advantage Estimation[EB/OL]. (2018-10-20) [2021-02-22]. .
22	Berner C, Brockman G, Chan B, et al. Dota 2 with Large Scale Deep Reinforcement Learning[J]. (2019-12-13) [2021-12-03]. .
23	周文卿, 朱纪洪, 匡敏驰. 一种基于群体智能的无人空战系统[J]. 中国科学(信息科学), 2020, 50(3): 363-374.
	Zhou Wenqing, Zhu Jihong, Kuang Minchi. An Unmanned Air Combat System Based on Swarm Intelligence[J]. Scientia Sinica(Informationis), 2020, 50(3): 363-374.
24	Silver D, Lever G, Heess N, et al. Deterministic Policy Gradient Algorithms[C]//Proceedings of the 31st International Conference on International Conference on Machine Learning. Chia Laguna Resort, Sardinia, Italy: PMLR, 2014: 387-395.
25	Haarnoja T, Zhou A, Abbeel P, et al. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort, Sardinia, Italy: PMLR, 2018: 1861-1870.

类型	参数	描述
直飞	目标俯仰角	直飞、拉升和俯冲
追踪	目标位置和方位	追踪敌人或预测点
盘旋	滚转角和俯仰轴	制定了一系列的固定参数
筋斗	俯仰角	用于快速改变方向和位置
攻击	目标预测位置	攻击目标预测位置
躲避	报警信息	调整机体垂直于报警方向飞行或下高回转

名称	奖励	奖励类型	描述	行为意图
击中	640	节点事件奖励	导弹击中对方敌机	鼓励击败敌方
被击中	-640	节点事件奖励	被敌方导弹击中	惩罚自身被击中
发射导弹	-30~-10	节点事件奖励	根据剩余导弹变化	惩罚发射导弹
视野	20	节点事件奖励	通过雷达获取到敌方位置	鼓励获取敌方视野
丢失视野	-20	节点事件奖励	本机丢失敌方视野	惩罚失去敌方视野
导弹近距离略过	50	节点事件奖励	己方导弹近距离略过敌机	鼓励有效发射导弹
躲避敌方导弹	50	节点事件奖励	己方近距离躲避敌方导弹	鼓励躲避来袭导弹
机体失速	-80~0	连续变化奖励	机体失速	惩罚机体失速
侧滑角限幅	-40~0	连续变化奖励	限制侧滑角度	惩罚过大侧滑角
导弹威胁	0~160	连续变化奖励	由相对角和接近时间计算	鼓励导弹给敌方带来威胁
姿态优势	0~200	连续变化奖励	根据弹目距离和角度计算	鼓励本机出于优势态势

[1]	Yu Xiang, Deng Qianrui, Duan Sirui, Jiang Chen. A Multi-UAV Collaborative Priority Coverage Search Algorithm [J]. Journal of System Simulation, 2024, 36(4): 991-1000.
[2]	Cheng Jie, Zheng Yuan, Li Chenglong, Jiang Bo. Multi-UAV Collaborative Trajectory Planning Algorithm for Urban Ultra-low-altitude Air Transportation Scenario [J]. Journal of System Simulation, 2024, 36(1): 50-66.
[3]	Tuo Zhao, Hanqiang Deng, Jialong Gao, Jian Huang. Dynamic Target Assignment of Multiple Unmanned Aerial Vehicles Based on Clustering of Network Nodes [J]. Journal of System Simulation, 2023, 35(4): 695-708.
[4]	Wang Chenguang, Bai Jinpeng, Li Tingting, Miao Lifeng, Wang Kaifeng. Reliability Evaluation Method of Radar Simulation Model Based on Air Combat Mechanism [J]. Journal of System Simulation, 2023, 35(10): 2113-2121.
[5]	Wang Yukun, Wang Ze, Dong Liwei, Li Ni. Research on Multi-aircraft Air Combat Behavior Modeling Based on Hierarchical Intelligent Modeling Methods [J]. Journal of System Simulation, 2023, 35(10): 2249-2261.
[6]	Sen Zhang, Mengyan Zhang, Jingping Shao, Jiexin Pu. Multi-UAVs 3D Path Planning Method Based on Random Strategy Search [J]. Journal of System Simulation, 2022, 34(6): 1286-1295.
[7]	Zhao Yu, Guo Jifeng, Yan Peng, Bai Chengchao. Self-learning-based Multiple Spacecraft Evasion Decision Making Simulation Under Sparse Reward Condition [J]. Journal of System Simulation, 2021, 33(8): 1766-1774.
[8]	Zhou Nan, Ai Jianliang. Speech Control Scheme Design and Simulation for UAV Based on HMM and RNN [J]. Journal of System Simulation, 2020, 32(3): 464-471.
[9]	Cao Huimin, Huang Anxiang, Lei Xiang. Evaluation Method of Imminent Battle Situation in Air Combat [J]. Journal of System Simulation, 2019, 31(2): 257-263.
[10]	Kou Ya’nan, Jiang Longting, Wang Dong. High-order reconstruction of maneuvering decision-making process in close air combat [J]. Journal of System Simulation, 2019, 31(10): 2085-2092.
[11]	Yang Xizhong, Ai Jianliang. Evasive Maneuvers Against Missiles for Unmanned Combat Aerial Vehicle in Autonomous Air Combat [J]. Journal of System Simulation, 2018, 30(5): 1957-1966.
[12]	Fu Yuewen, Wang Yuancheng, Chen Zhen, Fan Wenlan. Target Decision in Collaborative Air Combats Using Multi-agent Particle Swarm Optimization [J]. Journal of System Simulation, 2018, 30(11): 4151-4157.
[13]	Zhu Zhengqiu, Chen Bin, Qiu Sihang, Wang Rongxiao, Qiu Xiaogang. Research and Implementation on Source Term Estimation Methods of Pollution in Chemical Parks [J]. Journal of System Simulation, 2017, 29(9): 2134-2139.
[14]	Li Zhanwu, Chang Yizhe, Yang Haiyan, Kou Yingxin, Xu An. Situation Assessment Method for Cooperative Air Combat Based on Dynamic Combat Power Field [J]. Journal of System Simulation, 2015, 27(7): 1584-1590.