基于改进近端策略优化算法的无人车打击策略规划方法

doi:10.16182/j.issn1004731x.joss.25-0486

摘要/Abstract

摘要：

针对无人战车预设打击规则无法最大化打击命中率，以及连续运动规划与离散打击决策难以耦合优化问题，提出一种基于混合动作空间和GRU的改进PPO算法。建立无人车打击任务过程中的环境模型、目标模型，以及融合运动学约束、态势感知和动态决策的三级架构无人车模型；使用2个不同的策略网络用于路径规划的连续运动规划网络，求解打击位置及目标序列选择过程中打击决策问题的离散打击决策网络；引入GRU处理无人车决策过程因状态部分可观测需要依赖历史观察推断当前状态的问题。仿真结果表明：该方法能耦合优化无人车路径规划与打击决策问题，提升了无人车自主执行打击任务的能力。

关键词: 深度强化学习, 无人车, 路径规划, 打击决策, 近端策略优化

Abstract:

An improved PPO algorithm based on the hybrid action space and gated recurrent unit (GRU) is proposed to address the limitations of predefined strike rules in maximizing the hitting accuracy of unmanned ground vehicles and the difficult coupling and optimization of continuous motion planning and discrete strike decision-making. The environmental model and target model are built for the process of unmanned ground vehicles' strike missions, coupled with a three-layer model for unmanned ground vehicles that fuses kinematic constraints, situational awareness, and dynamic decision-making. Two distinct policy networks are employed, including the continuous motion planning network for path planning, and the discrete strike decision-making network for solving the strike decision-making problems in the process of strike location and target sequence selection. A GRU module is introduced to address the partially observable nature of the environment by inferring current states from historical observations. The simulation results show that this method can couple and optimize the path planning and strike decision-making of unmanned ground vehicles, improving the ability of unmanned ground vehicles to autonomously perform strike missions.

Key words: DRL, unmanned ground vehicle, path planning, strike decision-making, PPO

中图分类号:

TP391.9

王秉坤,王越,杨妹等 . 基于改进近端策略优化算法的无人车打击策略规划方法[J]. 系统仿真学报, 2026, 38(2): 372-386.

Wang Bingkun,Wang Yue,Yang Mei,et al . Strike Strategy Planning Method of Unmanned Ground Vehicles Based on Improved PPO Algorithm[J]. Journal of System Simulation, 2026, 38(2): 372-386.

图/表 16

图1

图2

图3

图4

表1

图5

图6

图7

表2

图8

表3

图9

图10

表4

图11

图12

参考文献 25

[1]	Wang Tong, Fu Liyue, Wei Zhengxian, et al. Unmanned Ground Weapon Target Assignment Based on Deep Q-learning Network with an Improved Multi-objective Artificial Bee Colony Algorithm[J]. Engineering Applications of Artificial Intelligence, 2023, 117, Part B: 105612.
[2]	Jia Yingjuan, Qu Liangdong, Li Xiaoqin. Automatic Path Planning of Unmanned Combat Aerial Vehicle Based on Double-layer Coding Method with Enhanced Grey Wolf Optimizer[J]. Artificial Intelligence Review, 2023, 56(10): 12257-12314.
[3]	Ahn Jisoo, Jung Sewoong, Kim Hansom, et al. A Study on Unmanned Combat Vehicle Path Planning for Collision Avoidance with Enemy Forces in Dynamic Situations[J]. Journal of Computational Design and Engineering, 2023, 10(6): 2251-2270.
[4]	王霄龙, 陈洋, 胡棉, 等. 基于改进深度Q网络的机器人持续监测路径规划[J]. 兵工学报, 2024, 45(6): 1813-1823.
	Wang Xiaolong, Chen Yang, Hu Mian, et al. Robot Path Planning for Persistent Monitoring Based on Improved Deep Q Networks[J]. Acta Armamentarii, 2024, 45(6): 1813-1823.
[5]	Qu Liangdong, Jia Yingjuan, Li Xiaoqin, et al. Two-stage Control Model Based on Enhanced Elephant Clan Optimization for Path Planning of Unmanned Combat Aerial Vehicle[J]. The Journal of Supercomputing, 2024, 80(17): 24938-24974.
[6]	Zhang Haojie, Yang Tiantian, Su Zhibao. A Formation Cooperative Reconnaissance Strategy for Multi-UGVs in Partially Unknown Environment[J]. Journal of the Chinese Institute of Engineers, 2023, 46(6): 551-562.
[7]	Su Wenjia, Gao Min, Gao Xinbao, et al. An Online Attack Decision Method for Unmanned Aerial Vehicle Cluster in Uncertain Environments[J]. IEEE Sensors Journal, 2024, 24(11): 18457-18466.
[8]	李传浩, 明振军, 王国新, 等. 基于多智能体深度强化学习的无人平台箔条干扰末端防御动态决策方法[J]. 兵工学报, 2025, 46(3): 19-33.
	Li Chuanhao, Ming Zhenjun, Wang Guoxin, et al. Dynamic Decision-making Method of Unmanned Platform Chaff Jamming for Terminal Defense Based on Multi-agent Deep Reinforcement Learning[J]. Acta Armamentarii, 2025, 46(3): 19-33.
[9]	Wang Ting, Deng Yuxiang, Yang Zhao, et al. Parameterized Deep Reinforcement Learning with Hybrid Action Space for Edge Task Offloading[J]. IEEE Internet of Things Journal, 2024, 11(6): 10754-10767.
[10]	张森, 代强强. 改进型深度确定性策略梯度的无人机路径规划[J]. 系统仿真学报, 2025, 37(4): 875-881.
	Zhang Sen, Dai Qiangqiang. UAV Path Planning Based on Improved Deep Deterministic Policy Gradients[J]. Journal of System Simulation, 2025, 37(4): 875-881.
[11]	张建东, 王鼎涵, 杨啟明, 等. 基于分层强化学习的无人机空战多维决策[J]. 兵工学报, 2023, 44(6): 1547-1563.
	Zhang Jiandong, Wang Dinghan, Yang Qiming, et al. Multi-dimensional Decision-making for UAV Air Combat Based on Hierarchical Reinforcement Learning[J]. Acta Armamentarii, 2023, 44(6): 1547-1563.
[12]	李超, 王瑞星, 黄建忠, 等. 稀疏奖励下基于强化学习的无人集群自主决策与智能协同[J]. 兵工学报, 2023, 44(6): 1537-1546.
	Li Chao, Wang Ruixing, Huang Jianzhong, et al. Autonomous Decision-making and Intelligent Collaboration of UAV Swarms Based on Reinforcement Learning with Sparse Rewards[J]. Acta Armamentarii, 2023, 44(6): 1537-1546.
[13]	Ma Chengdong, Liu Jianan, He Saichao, et al. Confrontation and Obstacle-avoidance of Unmanned Vehicles Based on Progressive Reinforcement Learning[J]. IEEE Access, 2023, 11: 50398-50411.
[14]	Yue Longfei, Yang Rennong, Zhang Ying, et al. Deep Reinforcement Learning for UAV Intelligent Mission Planning[J]. Complexity, 2022, 2022: 3551508.
[15]	Liu Wei, Zhang Tao, Huang Shengjun, et al. A Hybrid Optimization Framework for UAV Reconnaissance Mission Planning[J]. Computers & Industrial Engineering, 2022, 173: 108653.
[16]	Xiong Jiechao, Wang Qing, Yang Zhuoran, et al. Parametrized Deep Q-networks Learning: Reinforcement Learning with Discrete-continuous Hybrid Action Space[EB/OL]. (2018-10-10) [2025-04-01]. .
[17]	Fan Zhou, Su Rui, Zhang Weinan, et al. Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2019: 2279-2285.
[18]	Cao Jingyu, Dong Lu, Sun Changyin. Hierarchical Reinforcement Learning for Kinematic Control Tasks with Parameterized Action Spaces[J]. Neural Computing and Applications, 2024, 36(1): 323-336.
[19]	Han Guangjie, Feng Zixiao, Wang Hao, et al. Underwater Multi-target Node Path Planning in Hybrid Action Space: A Deep Reinforcement Learning Approach[J]. IEEE Transactions on Mobile Computing, 2024, 23(12): 13033-13047.
[20]	Xu Yahao, Wei Yiran, Jiang Keyang, et al. Action Decoupled SAC Reinforcement Learning with Discrete-continuous Hybrid Action Spaces[J]. Neurocomputing, 2023, 537: 141-151.
[21]	He Yufei, Hu Ruiqi, Liang Kewei, et al. Deep Reinforcement Learning Algorithm with Long Short-term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission[J]. Mathematics, 2025, 13(1): 46.
[22]	Zhang Cheng, Tao Chengyang, Xu Yuelei, et al. Autonomous Defense of Unmanned Aerial Vehicles Against Missile Attacks Using a GRU-based PPO Algorithm[J]. International Journal of Aeronautical and Space Sciences, 2024, 25(3): 1034-1049.
[23]	Hao Shuhui, Guan Wei, Cui Zhewen, et al. USV Collision Avoidance Decision-making Based on the Improved PPO Algorithm in Restricted Waters[J]. Journal of Marine Science and Engineering, 2024, 12(8): 1428.
[24]	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-08-28) [2025-02-10]. .
[25]	姜凌峰, 李新凯, 张海, 等. 基于改进TD3算法的无人机动态环境无地图导航[J]. 航空学报, 2025, 46(8): 292-307.
	Jiang Lingfeng, Li Xinkai, Zhang Hai, et al. Mapless Navigation of UAVs in Dynamic Environments Based on an Improved TD3 Algorithm[J]. Acta Aeronautica et Astronautica Sinica, 2025, 46(8): 292-307.

参数	值	参数	值
截断比率	0.2	训练周期	200
每轮更新频率	20	回合交互数	100
折扣率	0.99	熵	0.02
学习率	0.000 1	批量大小	32

指标	Hybrid-PPO	GRU-PPO	改进PPO
平均奖励	2.133	2.103	3.105
收敛阶段平均奖励	2.105	2.108	3.217
标准差	0.539	0.277	0.363
收敛阶段标准差	0.669	0.251	0.169

指标	Hybrid-PPO	GRU-PPO	改进PPO
平均损失	0.726	0.330	0.178
收敛阶段平均损失	0.915	0.197	0.084
标准差	3.579	0.692	0.293
收敛阶段标准差	6.027	0.281	0.089

指标	Hybrid-PPO	GRU-PPO	改进PPO
平均命中率	0.502 5	0.682 5	0.667 5
平均威胁度	0.326 7	0.363 4	0.396 7
行驶距离/km	0.692 0	2.959 0	1.945 0