基于非对称自博弈的非均势空战智能决策方法

doi:10.16182/j.issn1004731x.joss.25-0621

系统仿真学报 ›› 2026, Vol. 38 ›› Issue (2): 433-446.doi: 10.16182/j.issn1004731x.joss.25-0621

基于非对称自博弈的非均势空战智能决策方法

郑巍¹, 汤佳豪¹, 熊小平², 樊鑫¹

^1.南昌航空大学软件学院，江西南昌 330063
^2.中国民用航空江西航空器适航审定中心，江西南昌 330038

收稿日期:2025-06-30 修回日期:2025-09-19 出版日期:2026-02-18 发布日期:2026-02-11
通讯作者: 汤佳豪
第一作者简介:郑巍(1982-)，男，教授，硕导，博士，研究方向为人工智能、大数据分析及空战智能决策。
基金资助:
国家自然科学基金(62467005)

Intelligent Decision-making Method in Imbalanced Air Combat Based on Asymmetric Self-play

Zheng Wei¹, Tang Jiahao¹, Xiong Xiaoping², Fan Xin¹

^1.School of Software Engineering, Nanchang Hangkong University, Nanchang 330063, China
^2.Civil Aviation Administration of China Jiangxi Aircraft Airworthiness Certification Center, Nanchang 330038, China

Received:2025-06-30 Revised:2025-09-19 Online:2026-02-18 Published:2026-02-11
Contact: Tang Jiahao

摘要/Abstract

摘要：

为解决传统自博弈在非均势空战中因角色同质化导致策略趋同的难题，提出一种基于非对称自博弈的智能决策方法。通过采用分层强化学习框架解耦战术与控制为优劣势方设计差异化奖励函数；构建双向独立策略池以促进策略的协同进化；利用近端策略优化算法完成模型训练。在1v1武器失衡与2v1数量失衡场景下的实验表明：相较于对称自博弈，优势方击杀率最高提升12%，劣势方生存率最高提升40%，在多机对抗中整体效能亦获增强，验证了非对称设计在提升非均势空战决策智能体专项对抗能力与战术多样性方面的有效性。

关键词: 非均势空战, 非对称自博弈, 双向策略池, 分层强化学习, 近端策略优化

Abstract:

To solve the problem of strategy convergence caused by role homogenization in traditional self-play for imbalanced air combat, an intelligent decision-making method based on asymmetric self-play was proposed. This method decoupled tactics from control by employing a hierarchical reinforcement learning framework and designed differentiated reward functions for advantaged and disadvantaged sides. Bidirectional independent policy pools were constructed to promote the co-evolution of strategies. The proximal policy optimization algorithm was utilized to train the model. Experiments in 1v1 weapon-imbalanced and 2v1 numerically-imbalanced scenarios demonstrate that compared to symmetric self-play, the proposed method increases the kill rate of the advantaged side by up to 12% and the survival rate of the disadvantaged side by up to 40%. The overall effectiveness in multi-agent combat is also significantly enhanced. The study verifies the effectiveness of the asymmetric design in enhancing the specialized combat capabilities and tactical diversity of intelligent agents for imbalanced air combat.

Key words: imbalanced air combat, asymmetric self-play, bidirectional policy pool, hierarchical reinforcement learning, proximal policy optimization

中图分类号:

TP391.9

郑巍,汤佳豪,熊小平等 . 基于非对称自博弈的非均势空战智能决策方法[J]. 系统仿真学报, 2026, 38(2): 433-446.

Zheng Wei,Tang Jiahao,Xiong Xiaoping,et al . Intelligent Decision-making Method in Imbalanced Air Combat Based on Asymmetric Self-play[J]. Journal of System Simulation, 2026, 38(2): 433-446.

图/表 17

图1

图2

图3

表1

图4

表2

表3

图5

表4

PPO算法参数设置

主要参数	量值
折扣因子	0.99
学习率	$≤ 3 × 10 - 4$
裁剪率	0.2
批训练大小	32
回合步长	1 000
最大步长	$1 × 108$
经验池大小	3 000

表4

表5

图6

图7

图8

图9

图10

图11

图12

参考文献 20

[1]	Bi Zheyuan, Chen Hualiang, Hu Jie, et al. Analysis of UAV Typical War Cases and Combat Assessment Research[C]//2022 IEEE International Conference on Unmanned Systems (ICUS). Piscataway: IEEE, 2022: 1449-1453.
[2]	薛健, 赵琳, 向贤财, 等. 非完全信息下无人机集群对抗研究综述[J]. 电子与信息学报, 2024, 46(4): 1157-1172.
	Xue Jian, Zhao Lin, Xiang Xiancai, et al. A Review of the Research on UAV Swarm Confrontation under Incomplete Information[J]. Journal of Electronics & Information Technology, 2024, 46(4): 1157-1172.
[3]	Wu Pengcheng, Wang Hongqiao, Liang Gaowei, et al. Research on Unmanned Aerial Vehicle Cluster Collaborative Countermeasures Based on Dynamic Non-zero-sum Game Under Asymmetric and Uncertain Information[J]. Aerospace, 2023, 10(8): 711.
[4]	Zhang Zhe, Jiang Ju, Xu Haiyan, et al. Distributed Dynamic Task Allocation for Unmanned Aerial Vehicle Swarm Systems: A Networked Evolutionary Game-theoretic Approach[J]. Chinese Journal of Aeronautics, 2024, 37(6): 182-204.
[5]	Liu Jinrong. An Improved Genetic Algorithm for Rapid UAV Path Planning[J]. Journal of Physics: Conference Series, 2022, 2216(1): 012035.
[6]	Hazha Saeed Yahia, Amin Salih Mohammed. Path Planning Optimization in Unmanned Aerial Vehicles Using Meta-heuristic Algorithms: A Systematic Review[J]. Environmental Monitoring and Assessment, 2023, 195(1): 30.
[7]	傅莉, 谢福怀, 孟光磊, 等. 基于滚动时域的无人机空战决策专家系统[J]. 北京航空航天大学学报, 2015, 41(11): 1994-1999.
	Fu Li, Xie Fuhuai, Meng Guanglei, et al. An UAV Air-combat Decision Expert System Based on Receding Horizon Control[J]. Journal of Beijing University of Aeronautics and Astronautics, 2015, 41(11): 1994-1999.
[8]	Frey Matthias A, Attmanspacher Jonas, Schulte Axel. A Dynamic Bayesian Network and Markov Decision Process for Tactical UAV Decision Making in MUM-T Scenarios[C]//2022 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA). Piscataway: IEEE, 2022: 47-54.
[9]	江达伟, 董阳阳, 张立东, 等. 基于深度学习的空中目标威胁评估技术研究[J]. 系统仿真学报, 2025, 37(3): 791-802.
	Jiang Dawei, Dong Yangyang, Zhang Lidong, et al. Research on Air Target Threat Assessment Technology Based on Deep Learning[J]. Journal of System Simulation, 2025, 37(3): 791-802.
[10]	Schimpf N, Wang Zhe, Li S, et al. A Generalized Approach to Aircraft Trajectory Prediction via Supervised Deep Learning[J]. IEEE Access, 2023, 11: 116183-116195.
[11]	Yang Qiming, Zhang Jiandong, Shi Guoqing, et al. Maneuver Decision of UAV in Short-Range Air Combat Based on Deep Reinforcement Learning[J]. IEEE Access, 2020, 8: 363-378.
[12]	Zhang Jiandong, Yang Qiming, Shi Guoqing, et al. UAV Cooperative Air Combat Maneuver Decision Based on Multi-agent Reinforcement Learning[J]. Journal of Systems Engineering and Electronics, 2021, 32(6): 1421-1438.
[13]	Silver D, Huang A, Maddison C J, et al. Mastering the Game of Go with Deep Neural Networks and Tree Search[J]. Nature, 2016, 529(7587): 484-489.
[14]	Berner C, Brockman G, Chan B, et al. Dota 2 with Large Scale Deep Reinforcement Learning[EB/OL]. (2019-12-13) [2025-05-21]. .
[15]	Ye Ziyu, Agarwal R, Liu Tianqi, et al. Evolving Alignment via Asymmetric Self-play[EB/OL]. (2024-10-31) [2025-05-21]. .
[16]	Kong Weiren, Zhou Deyun, Zhou Ying, et al. Hierarchical Reinforcement Learning from Competitive Self-play for Dual-aircraft Formation Air Combat[J]. Journal of Computational Design and Engineering, 2023, 10(2): 830-859.
[17]	单圣哲, 杨孟超, 张伟伟, 等. 自主空战连续决策方法[J]. 航空工程进展, 2022, 13(5): 47-58.
	Shan Shengzhe, Yang Mengchao, Zhang Weiwei, et al. Continuous Decision-making Method for Autonomous Air Combat[J]. Advances in Aeronautical Science and Engineering, 2022, 13(5): 47-58.
[18]	单圣哲, 张伟伟. 基于自博弈深度强化学习的空战智能决策方法[J]. 航空学报, 2024, 45(4): 200-212.
	Shan Shengzhe, Zhang Weiwei. Air Combat Intelligent Decision-making Method Based on Self-play and Deep Reinforcement Learning[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(4): 200-212.
[19]	Jaderberg M, Czarnecki W M, Dunning I, et al. Human-level Performance in 3D Multiplayer Games with Population-based Reinforcement Learning[J]. Science, 2019, 364(6443): 859-865.
[20]	Vinyals O, Babuschkin I, Chung J, et al. Alphastar: Mastering the Real-time Strategy Game Starcraft II[J]. DeepMind Blog, 2019, 2(1): 20.

对抗场景	判定结果	双方存活机体数量
1v1均势	胜利	{(1,0)}
	失败	{(0,1)}
	平局	{(1,1), (0,0)}
1v1优势	胜利	{(1,0)}
1v1优势	失败	{(1,1), (0,1), (0,0)}
1v1劣势	胜利	{(1,1), (1,0), (0,0)}
1v1劣势	失败	{(0,1)}
2v2均势	胜利	{(2,1), (2,0), (1,0)}
	失败	{(1,2), (0,2), (0,1)}
	平局	{(2,2), (1,1), (0,0)}
2v1优势	胜利	{(2,0)}
	失败	{(2,1), (1,0), (0,1)}
	平局	{(1,1), (0,0)}
2v1劣势	胜利	{(1,2), (1,0), (0,1)}
	失败	{(0,2)}
	平局	{(1,1), (0,0)}

参数名称	类型/量值
无人机类型	F-16战机
导弹类型	AIM-9L导弹
最小发射间隔	25 s(125仿真步)
最大攻击角/(°)	45
初始空速/(m/s)	240
初始高度/m	6 000
物理步长	1/60 s(60 Hz)
仿真步长/s	0.2
最大作战时长/s	200 (1 000仿真步)

对抗场景	数量配置/架	武器配置(枚/架)
1v1均势	1	3
1v1优势	1	2
1v1劣势	1	不携带武器
2v2均势	2	2
2v1优势	2	2
2v1劣势	1	3

基于非对称自博弈的非均势空战智能决策方法

Intelligent Decision-making Method in Imbalanced Air Combat Based on Asymmetric Self-play

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 17

参考文献 20

相关文章 5

编辑推荐

Metrics

本文评价

主要参数	量值
初始ELO得分	1 000
ELO评分调整系数	16
保存间隔/轮	3 000
训练对手数量	4
训练对手重置间隔/轮	3 000
评估对手数量	4
评估对手重置间隔/轮	3 000

[1]	王秉坤, 王越, 杨妹, 张鹏年, 樊浡昊, 唐杰. 基于改进近端策略优化算法的无人车打击策略规划方法[J]. 系统仿真学报, 2026, 38(2): 372-386.
[2]	伍国华, 曾家恒, 王得志, 郑龙, 邹伟. 基于深度强化学习的四旋翼航迹跟踪控制方法[J]. 系统仿真学报, 2025, 37(5): 1169-1187.
[3]	王贺, 许佳宁, 闫广宇. 基于深度强化学习的AGV行人避让策略研究[J]. 系统仿真学报, 2025, 37(3): 595-606.
[4]	祝靖宇, 张宏立, 匡敏驰, 史恒, 朱纪洪, 乔直, 周文卿. 稀疏奖励下基于课程学习的无人机空战仿真[J]. 系统仿真学报, 2024, 36(6): 1452-1467.
[5]	林俊强, 王红军, 邹湘军, 张坡, 李承恩, 周益鹏, 姚书杰. 基于DPPO的移动采摘机器人避障路径规划及仿真[J]. 系统仿真学报, 2023, 35(8): 1692-1704.