Reinforcement Learning Based Method for UAV Team Orienteering Optimization under Multi-constraint Condition

doi:10.16182/j.issn1004731x.joss.25-0595

Abstract

Abstract:

Traditional optimization methods struggle with efficiency, while reinforcement learning approaches often yield low solution quality and high training costs. In response, this paper proposes an attention mechanism-based reinforcement learning method. A dynamic attentionstrategy network with multi-information fusion is designed to improve solution quality. A visibility-graph approach is employed to simplify threat zone constraints and speed up convergence, and a decoding sequence reordering mechanism is introduced for further performance optimization of the solution. The simulation results show that the method generates high-quality solutions within milliseconds, achieving total rewards that approach or even surpass those obtained by traditional solvers such as Ortools and PyVRP within several seconds to hundreds of seconds. The training efficiency is enhanced significantly, with the training time per epoch reducing from several hours to about 30 minutes.

Key words: reinforcement learning, team orienteering problem, multi-UAV systems, attention mechanism

CLC Number:

TP391.9

Yang Can, Chen Kai, Zhu Feng. Reinforcement Learning Based Method for UAV Team Orienteering Optimization under Multi-constraint Condition[J]. Journal of System Simulation, 2026, 38(2): 360-371.

Figures/Tables 12

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Table 1

Table 2

Table 3

Fig. 6

Fig. 7

Fig. 8

Fig. 9

References 26

[1]	Tordesillas J, How J P. PANTHER: Perception-aware Trajectory Planner in Dynamic Environments[J]. IEEE Access, 2022, 10: 22662-22677.
[2]	宁聪, 范菁, 孙书魁. 多无人机协同规划研究综述[J]. 计算机工程与应用, 2025, 61(1): 42-58.
	Ning Cong, Fan Jing, Sun Shukui. Review of Multi-UAV Collaborative Planning Research[J]. Computer Engineering and Applications, 2025, 61(1): 42-58.
[3]	Reyes-Rubiano Lorena S, Ospina-Trujillo Carlos F, Faulin Javier, et al. The Team Orienteering Problem with Stochastic Service Times and Driving-range Limitations: A Simheuristic Approach[C]//2018 Winter Simulation Conference (WSC). Piscataway: IEEE, 2018: 3025-3035.
[4]	Chung K T, Lee C K M, Tsang Y P. Neural Combinatorial Optimization with Reinforcement Learning in Industrial Engineering: A Survey[J]. Artificial Intelligence Review, 2025, 58(5): 130.
[5]	Jiang Mingyang, Li Yueyuan, Zhang Songan, et al. HOPE: A Reinforcement Learning-based Hybrid Policy Path Planner for Diverse Parking Scenarios[J]. IEEE Transactions on Intelligent Transportation Systems, 2025, 26(5): 6130-6141.
[6]	Kool W, Van Hoof H, Welling M. Attention, Learn to Solve Routing Problems![C]//International Conference on Learning Representations. New Orleans, LA, USA: OpenReview.net, 2019: 1-14.
[7]	Braekers Kris, Ramaekers Katrien, Van Nieuwenhuyse Inneke. The Vehicle Routing Problem: State of the Art Classification and Review[J]. Computers & Industrial Engineering, 2016, 99: 300-313.
[8]	Cai Junchuang, Zhang Xinzhi, Lin Qiuzhen, et al. Deep Reinforcement Learning for Solving the Vehicle Routing Problem in Practical Logistics[C]//2024 IEEE Congress on Evolutionary Computation (CEC). Piscataway: IEEE, 2024: 1-8.
[9]	Rajwar Kanchan, Deep Kusum, Das Swagatam. An Exhaustive Review of the Metaheuristic Algorithms for Search and Optimization: Taxonomy, Applications, and Open Challenges[J]. Artificial Intelligence Review, 2023, 56(11): 13187-13257.
[10]	Berto Federico, Hua Chuanbo, Park Junyoung, et al. RL4CO: An Extensive Reinforcement Learning for Combinatorial Optimization Benchmark[EB/OL]. (2023-06-29) [2025-06-01]. .
[11]	Mazyavkina Nina, Sviridov Sergey, Ivanov Sergei, et al. Reinforcement Learning for Combinatorial Optimization: A Survey[J]. Computers & Operations Research, 2021, 134: 105400.
[12]	AlMahamid Fadi, Grolinger Katarina. Agile DQN: Adaptive Deep Recurrent Attention Reinforcement Learning for Autonomous UAV Obstacle Avoidance[J]. Scientific Reports, 2025, 15(1): 18043.
[13]	He Yong, Hou Ticheng, Wang Mingran. A New Method for Unmanned Aerial Vehicle Path Planning in Complex Environments[J]. Scientific Reports, 2024, 14(1): 9257.
[14]	Vansteenwegen Pieter, Souffriau Wouter, Van Oudheusden Dirk. The Orienteering Problem: A Survey[J]. European Journal of Operational Research, 2011, 209(1): 1-10.
[15]	Drakulic Darko, Michel Sofia, Mai Florian, et al. BQ-NCO: Bisimulation Quotienting for Efficient Neural Combinatorial Optimization[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 77416-77429.
[16]	Berto Federico, Hua Chuanbo, Luttmann Laurin, et al. PARCO: Learning Parallel Autoregressive Policies for Efficient Multi-agent Combinatorial Optimization[EB/OL]. (2024-09-05) [2025-06-01]. .
[17]	Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[18]	王扬, 陈智斌, 吴兆蕊, 等. 强化学习求解组合最优化问题的研究综述[J]. 计算机科学与探索, 2022, 16(2): 261-279.
	Wang Yang, Chen Zhibin, Wu Zhaorui, et al. Review of Reinforcement Learning for Combinatorial Optimization Problem[J]. Journal of Frontiers of Computer Science & Technology, 2022, 16(2): 261-279.
[19]	Bello Irwan, Pham Hieu, V Le Quoc, et al. Neural Combinatorial Optimization with Reinforcement Learning[EB/OL]. (2016-11-29) [2025-06-01]. .
[20]	Márcio da Silva Arantes, Jesimar da Silva Arantes, Claudio Fabiano Motta Toledo, et al. A Hybrid Multi-population Genetic Algorithm for UAV Path Planning[C]//Proceedings of the Genetic and Evolutionary Computation Conference 2016. New York: ACM, 2016: 853-860.
[21]	Bittner Jiří, Wonka Peter. Visibility in Computer Graphics[J]. Environment and Planning B: Planning and Design, 2003, 30(5): 729-755.
[22]	Zhao Jiuxia, Mao Minjia, Zhao Xi, et al. A Hybrid of Deep Reinforcement Learning and Local Search for the Vehicle Routing Problems[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22(11): 7208-7218.
[23]	Vansteenwegen Pieter, Souffriau Wouter, Greet Vanden Berghe, et al. Iterated Local Search for the Team Orienteering Problem with Time Windows[J]. Computers & Operations Research, 2009, 36(12): 3281-3290.
[24]	Wouda N A, Lan L, Kool W. PyVRP: : A High-performance VRP Solver Package[J]. Informs Journal on Computing, 2024, 36(4): 943-955.
[25]	Nahavandi Saeid, Alizadehsani Roohallah, Nahavandi Darius, et al. A Comprehensive Review on Autonomous Navigation[J]. ACM Computing Surveys, 2025, 57(9): 234.
[26]	Li Sirui, Yan Zhongxia, Wu C. Learning to Delegate for Large-scale Vehicle Routing[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 26198-26211.

参数名称	取值
注意力层数	3
特征维数	128
批大小	512
训练样本总数	1.28亿
每轮次样本数	128万
轮次	100
优化器	Adam
学习率	0.000 1
策略	Sample

场景	角度约束	威胁区个数	MIFDAM模型	AM模型	OR-tools	PyVRP
50-3	有	3	15.49 (<1 s)	14.63 (<1 s)	13.99 (10 s)	不适用
	有	4	14.76 (<1 s)	13.80 (<1 s)	13.65 (10 s)	不适用
	有	5	14.17 (<1 s)	13.33 (<1 s)	12.21 (10 s)	不适用
	无	3	20.04 (<1 s)	19.48 (<1 s)	20.01 (1 s)	18.89 (1 s)
	无	4	19.51 (<1 s)	18.79(<1 s)	19.92 (1 s)	18.73 (1 s)
	无	5	18.96 (<1 s)	18.03 (<1 s)	19.86 (1 s)	18.50 (1 s)
100-5	有	3	32.37 (<1 s)	30.17 (<1 s)	29.13 (200 s)	不适用
	有	4	31.20 (<1 s)	28.90 (<1 s)	29.08 (200 s)	不适用
	有	5	30.01 (<1 s)	28.62 (<1 s)	28.36 (200 s)	不适用
	无	3	39.75 (<1 s)	38.41 (<1 s)	38.57 (1 s)	33.16 (1 s)
	无	4	38.95 (<1 s)	37.76 (<1 s)	38.54 (1 s)	31.10 (1 s)
	无	5	38.27 (<1 s)	37.52 (<1 s)	38.48 (1 s)	31.09 (1 s)

序号	MIFDAM	解码顺序重排后
50-3	15.49	15.53
50-4	18.26	18.39
50-5	20.23	20.65
100-5	32.37	32.88
100-6	35.98	36.43
100-7	38.94	39.98