多约束条件下基于强化学习的无人机团队定向优化方法

doi:10.16182/j.issn1004731x.joss.25-0595

摘要/Abstract

摘要：

为解决多重复杂场景下传统优化方法难以高效求解，而现有强化学习方法存在求解质量不高、训练效率低的问题，提出了一种基于注意力机制的强化学习高效求解方法。设计多信息融合的动态注意力策略网络以提升解的质量；结合可视图法简化威胁区约束，加快了训练收敛速度；解码阶段引入顺序重排机制，优化了解的性能。仿真结果表明：该方法可在毫秒级时间内生成高质量解，其总奖励逼近甚至优于Ortools与PyVRP等传统求解器在数秒至数百秒内所得结果，训练效率大幅提升，单轮训练时间由数小时缩短至约30 min。

关键词: 强化学习, 团队定向问题, 多无人机系统, 注意力机制

Abstract:

Traditional optimization methods struggle with efficiency, while reinforcement learning approaches often yield low solution quality and high training costs. In response, this paper proposes an attention mechanism-based reinforcement learning method. A dynamic attentionstrategy network with multi-information fusion is designed to improve solution quality. A visibility-graph approach is employed to simplify threat zone constraints and speed up convergence, and a decoding sequence reordering mechanism is introduced for further performance optimization of the solution. The simulation results show that the method generates high-quality solutions within milliseconds, achieving total rewards that approach or even surpass those obtained by traditional solvers such as Ortools and PyVRP within several seconds to hundreds of seconds. The training efficiency is enhanced significantly, with the training time per epoch reducing from several hours to about 30 minutes.

Key words: reinforcement learning, team orienteering problem, multi-UAV systems, attention mechanism

中图分类号:

TP391.9

杨灿,陈凯,朱峰 . 多约束条件下基于强化学习的无人机团队定向优化方法[J]. 系统仿真学报, 2026, 38(2): 360-371.

Yang Can,Chen Kai,Zhu Feng . Reinforcement Learning Based Method for UAV Team Orienteering Optimization under Multi-constraint Condition[J]. Journal of System Simulation, 2026, 38(2): 360-371.

图/表 12

图1

图2

图3

图4

图5

表1

表2

表3

图6

图7

图8

图9

参考文献 26

[1]	Tordesillas J, How J P. PANTHER: Perception-aware Trajectory Planner in Dynamic Environments[J]. IEEE Access, 2022, 10: 22662-22677.
[2]	宁聪, 范菁, 孙书魁. 多无人机协同规划研究综述[J]. 计算机工程与应用, 2025, 61(1): 42-58.
	Ning Cong, Fan Jing, Sun Shukui. Review of Multi-UAV Collaborative Planning Research[J]. Computer Engineering and Applications, 2025, 61(1): 42-58.
[3]	Reyes-Rubiano Lorena S, Ospina-Trujillo Carlos F, Faulin Javier, et al. The Team Orienteering Problem with Stochastic Service Times and Driving-range Limitations: A Simheuristic Approach[C]//2018 Winter Simulation Conference (WSC). Piscataway: IEEE, 2018: 3025-3035.
[4]	Chung K T, Lee C K M, Tsang Y P. Neural Combinatorial Optimization with Reinforcement Learning in Industrial Engineering: A Survey[J]. Artificial Intelligence Review, 2025, 58(5): 130.
[5]	Jiang Mingyang, Li Yueyuan, Zhang Songan, et al. HOPE: A Reinforcement Learning-based Hybrid Policy Path Planner for Diverse Parking Scenarios[J]. IEEE Transactions on Intelligent Transportation Systems, 2025, 26(5): 6130-6141.
[6]	Kool W, Van Hoof H, Welling M. Attention, Learn to Solve Routing Problems![C]//International Conference on Learning Representations. New Orleans, LA, USA: OpenReview.net, 2019: 1-14.
[7]	Braekers Kris, Ramaekers Katrien, Van Nieuwenhuyse Inneke. The Vehicle Routing Problem: State of the Art Classification and Review[J]. Computers & Industrial Engineering, 2016, 99: 300-313.
[8]	Cai Junchuang, Zhang Xinzhi, Lin Qiuzhen, et al. Deep Reinforcement Learning for Solving the Vehicle Routing Problem in Practical Logistics[C]//2024 IEEE Congress on Evolutionary Computation (CEC). Piscataway: IEEE, 2024: 1-8.
[9]	Rajwar Kanchan, Deep Kusum, Das Swagatam. An Exhaustive Review of the Metaheuristic Algorithms for Search and Optimization: Taxonomy, Applications, and Open Challenges[J]. Artificial Intelligence Review, 2023, 56(11): 13187-13257.
[10]	Berto Federico, Hua Chuanbo, Park Junyoung, et al. RL4CO: An Extensive Reinforcement Learning for Combinatorial Optimization Benchmark[EB/OL]. (2023-06-29) [2025-06-01]. .
[11]	Mazyavkina Nina, Sviridov Sergey, Ivanov Sergei, et al. Reinforcement Learning for Combinatorial Optimization: A Survey[J]. Computers & Operations Research, 2021, 134: 105400.
[12]	AlMahamid Fadi, Grolinger Katarina. Agile DQN: Adaptive Deep Recurrent Attention Reinforcement Learning for Autonomous UAV Obstacle Avoidance[J]. Scientific Reports, 2025, 15(1): 18043.
[13]	He Yong, Hou Ticheng, Wang Mingran. A New Method for Unmanned Aerial Vehicle Path Planning in Complex Environments[J]. Scientific Reports, 2024, 14(1): 9257.
[14]	Vansteenwegen Pieter, Souffriau Wouter, Van Oudheusden Dirk. The Orienteering Problem: A Survey[J]. European Journal of Operational Research, 2011, 209(1): 1-10.
[15]	Drakulic Darko, Michel Sofia, Mai Florian, et al. BQ-NCO: Bisimulation Quotienting for Efficient Neural Combinatorial Optimization[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 77416-77429.
[16]	Berto Federico, Hua Chuanbo, Luttmann Laurin, et al. PARCO: Learning Parallel Autoregressive Policies for Efficient Multi-agent Combinatorial Optimization[EB/OL]. (2024-09-05) [2025-06-01]. .
[17]	Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[18]	王扬, 陈智斌, 吴兆蕊, 等. 强化学习求解组合最优化问题的研究综述[J]. 计算机科学与探索, 2022, 16(2): 261-279.
	Wang Yang, Chen Zhibin, Wu Zhaorui, et al. Review of Reinforcement Learning for Combinatorial Optimization Problem[J]. Journal of Frontiers of Computer Science & Technology, 2022, 16(2): 261-279.
[19]	Bello Irwan, Pham Hieu, V Le Quoc, et al. Neural Combinatorial Optimization with Reinforcement Learning[EB/OL]. (2016-11-29) [2025-06-01]. .
[20]	Márcio da Silva Arantes, Jesimar da Silva Arantes, Claudio Fabiano Motta Toledo, et al. A Hybrid Multi-population Genetic Algorithm for UAV Path Planning[C]//Proceedings of the Genetic and Evolutionary Computation Conference 2016. New York: ACM, 2016: 853-860.
[21]	Bittner Jiří, Wonka Peter. Visibility in Computer Graphics[J]. Environment and Planning B: Planning and Design, 2003, 30(5): 729-755.
[22]	Zhao Jiuxia, Mao Minjia, Zhao Xi, et al. A Hybrid of Deep Reinforcement Learning and Local Search for the Vehicle Routing Problems[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22(11): 7208-7218.
[23]	Vansteenwegen Pieter, Souffriau Wouter, Greet Vanden Berghe, et al. Iterated Local Search for the Team Orienteering Problem with Time Windows[J]. Computers & Operations Research, 2009, 36(12): 3281-3290.
[24]	Wouda N A, Lan L, Kool W. PyVRP: : A High-performance VRP Solver Package[J]. Informs Journal on Computing, 2024, 36(4): 943-955.
[25]	Nahavandi Saeid, Alizadehsani Roohallah, Nahavandi Darius, et al. A Comprehensive Review on Autonomous Navigation[J]. ACM Computing Surveys, 2025, 57(9): 234.
[26]	Li Sirui, Yan Zhongxia, Wu C. Learning to Delegate for Large-scale Vehicle Routing[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 26198-26211.

参数名称	取值
注意力层数	3
特征维数	128
批大小	512
训练样本总数	1.28亿
每轮次样本数	128万
轮次	100
优化器	Adam
学习率	0.000 1
策略	Sample

场景	角度约束	威胁区个数	MIFDAM模型	AM模型	OR-tools	PyVRP
50-3	有	3	15.49 (<1 s)	14.63 (<1 s)	13.99 (10 s)	不适用
	有	4	14.76 (<1 s)	13.80 (<1 s)	13.65 (10 s)	不适用
	有	5	14.17 (<1 s)	13.33 (<1 s)	12.21 (10 s)	不适用
	无	3	20.04 (<1 s)	19.48 (<1 s)	20.01 (1 s)	18.89 (1 s)
	无	4	19.51 (<1 s)	18.79(<1 s)	19.92 (1 s)	18.73 (1 s)
	无	5	18.96 (<1 s)	18.03 (<1 s)	19.86 (1 s)	18.50 (1 s)
100-5	有	3	32.37 (<1 s)	30.17 (<1 s)	29.13 (200 s)	不适用
	有	4	31.20 (<1 s)	28.90 (<1 s)	29.08 (200 s)	不适用
	有	5	30.01 (<1 s)	28.62 (<1 s)	28.36 (200 s)	不适用
	无	3	39.75 (<1 s)	38.41 (<1 s)	38.57 (1 s)	33.16 (1 s)
	无	4	38.95 (<1 s)	37.76 (<1 s)	38.54 (1 s)	31.10 (1 s)
	无	5	38.27 (<1 s)	37.52 (<1 s)	38.48 (1 s)	31.09 (1 s)

序号	MIFDAM	解码顺序重排后
50-3	15.49	15.53
50-4	18.26	18.39
50-5	20.23	20.65
100-5	32.37	32.88
100-6	35.98	36.43
100-7	38.94	39.98