基于深度强化学习的四旋翼航迹跟踪控制方法

doi:10.16182/j.issn1004731x.joss.24-0025

系统仿真学报 ›› 2025, Vol. 37 ›› Issue (5): 1169-1187.doi: 10.16182/j.issn1004731x.joss.24-0025

基于深度强化学习的四旋翼航迹跟踪控制方法

伍国华¹, 曾家恒², 王得志³, 郑龙⁴, 邹伟⁵

^1.中南大学自动化学院，湖南长沙 410083
^2.中南大学交通运输工程学院，湖南长沙 410083
^3.国防科技大学气象海洋学院，湖南长沙 410015
^4.国防科技大学军事职业教育技术服务中心，湖南长沙 410015
^5.中南大学计算机学院，湖南长沙 410083

收稿日期:2024-01-08 修回日期:2024-03-12 出版日期:2025-05-20 发布日期:2025-05-23
通讯作者: 王得志
第一作者简介:伍国华(1986-)，男，教授，博士，研究方向为无人机系统与强化学习算法设计。
基金资助:
国家自然科学基金(62373380)

A Quadrotor Trajectory Tracking Control Method Based on Deep Reinforcement Learning

Wu Guohua¹, Zeng Jiaheng², Wang Dezhi³, Zheng Long⁴, Zou Wei⁵

^1.School of Automation Central South University, Changsha 410083, China
^2.School of Traffic and Transportation Engineering, Central South University, Changsha 410083, China
^3.School of Meteorology and Oceanography, National University of Defense Technology, Changsha 410015, China
^4.Military Vocational Education Technology Service Center, National University of Defense Technology, Changsha 410015, China
^5.School of Computer Science and Engineering, Central South University, Changsha 410083, China

Received:2024-01-08 Revised:2024-03-12 Online:2025-05-20 Published:2025-05-23
Contact: Wang Dezhi

摘要/Abstract

摘要：

受限于模型方程决定的固定结构，传统四旋翼控制器设计难以有效应对模型参数和环境扰动变化带来的控制误差。提出了基于深度强化学习的四旋翼航迹跟踪控制方法，构建了对应的马尔可夫决策模型，并基于PPO框架提出了PPO-SAG（PPO with self adaptive guide）算法。PPO-SAG在学习过程中加入自适应机制，利用PID专家知识进行引导和学习，提高了训练的收敛效果和稳定性。根据问题特点，设计了带有距离约束惩罚和熵策略的目标函数，提出扰动误差信息补充结构和航迹特征选择结构，补充控制误差信息、提取未来航迹关键要素，提高了收敛效果。并利用状态动态标准化、优势函数批标准化及奖励缩放策略，更合理地处理三维空间中的状态表征和奖励优势表达。单种航迹与混合航迹实验表明，所提出的PPO-SAG算法在收敛效果和稳定性上均取得了最好的效果，消融实验说明所提出的改进机制和结构均起到正向作用。所研究的未知扰动下基于深度强化学习的四旋翼航迹跟踪控制问题，为设计更加鲁棒高效的四旋翼控制器提供了解决方案。

关键词: 深度强化学习, 四旋翼航迹跟踪控制, 近端策略优化（PPO）, 自适应机制, 注意力机制

Abstract:

Traditional quadrotor controllers, constrained by fixed model equation structures, encounter challenges in addressing control errors stemming from variations in parameters and environmental disturbances. This paper proposes a deep reinforcement learning solution for the quadrotor trajectory following control problem. We present the PPO-SAG algorithm incorporated into the PPO framework, utilizing adaptive mechanisms and PID expert knowledge to enhance training convergence and stability. Target functions incorporating distance constraint penalties and entropy policies are designed in alignment with the characteristics of the given problem.We also devise innovative disturbance-adaptive structures and trajectory feature selection mechanisms to augment control error information and extract crucial elements from future trajectories, thereby enhancing convergence. Experiments on single and mixed trajectories indicate that the PPO-SAG algorithm achieves superior performance in both convergence and stability. Verification experiments confirm positive effects of proposed improvements. The trajectory tracking control problem of quadrotors based on deep reinforcement learning under unknown disturbances studied in this paper provides a solution for designing more robust and efficient quadrotor controllers.

Key words: deep reinforcement learning, track following control, proximal policy optimization(PPO), adaptive mechanism, attention mechanism

中图分类号:

TP273

伍国华,曾家恒,王得志等 . 基于深度强化学习的四旋翼航迹跟踪控制方法[J]. 系统仿真学报, 2025, 37(5): 1169-1187.

Wu Guohua,Zeng Jiaheng,Wang Dezhi,et al . A Quadrotor Trajectory Tracking Control Method Based on Deep Reinforcement Learning[J]. Journal of System Simulation, 2025, 37(5): 1169-1187.

图/表 24

图1

图2

图3

图4

表1

Crazyflie参数及扰动范围设置

Crazyflie参数	标称值	扰动
四旋翼质量 $m / k g$	$0.03$	$± 15 %$
转动惯量 $J / (k g m 2)$	$[1.43 × 10 - 5,$ $1.43 × 10 - 5, 2.89 × 10 - 5]$	$± 15 %$
重力加速度 $g / (m / s 2)$	$[0,0, - 9.81]$	$± 0.1$
$x$ 轴空气阻力 $k v x / (N s / m)$	$0.1$	$± 0.05$
$y$ 轴空气阻力 $k v y / (N s / m)$	$0.1$	$± 0.05$
$z$ 轴空气阻力 $k v z / (N s / m)$	$0.05$	$± 0.03$
四旋翼螺旋桨臂长 $d i / m$	$0.046$	$0$
最大转速 $Ω m a x / (r a d / s)$	$2 500$	$0$
推力系数 $c l / (N / (r a d ⋅ s))$	$2.3 × 10 - 8$	$0$
转动矩 $c d / (N ⋅ m / (r a d ⋅ s))$	$7.8 × 10 - 11$	$0$

表1

表2

PID控制算法参数设置

PID控制参数名称	参数设置
位置误差系数 $K p$	$d i a g (22,22,15)$
速度误差系数 $K v$	$d i a g (7.9,7.9,6.45)$
姿态误差系数 $K R$	$d i a g (3 500,3 500,400)$
角速度误差系数 $K ω$	$d i a g (107.75,107.75,43.75)$

表2

表3

深度强化学习算法参数设置

参数名称	参数设置	参数名称	参数设置
折扣因子 $γ$	$0.99$	最大步长	$1 × 107$
学习率 $L r$	$≤ 3 × 10 - 4$	批训练大小	$64$
距离系数 $β$	$5.0$	回合步长	$1 800$
裁剪率 $ε$	$0.2$	经验池大小	$1 024$
航迹数	$50$	退火轮次 $L$	$100$
模拟退火系数	$0.95$	初始温度	$1 000$

表3

图5

图6

图7

表4

图8

图9

图10

图11

图12

图13

图14

表5

大扰动范围设置

参数名称	新扰动范围	参数名称	新扰动范围
四旋翼质量	$± 20 %$	$x$ 轴空气阻力	$± 0.08$
转动惯量	$± 20 %$	$y$ 轴空气阻力	$± 0.08$
重力加速度	$± 0.12$	$z$ 轴空气阻力	$± 0.05$

表5

表6

图15

图16

图17

图18

参考文献 46

1	伍国华, 毛妮, 徐彬杰, 等. 基于自适应大规模邻域搜索算法的多车辆与多无人机协同配送方法[J]. 控制与决策, 2023, 38(1): 201-210.
	Wu Guohua, Mao Ni, Xu Binjie, et al. The Cooperative Delivery of Multiple Vehicles and Multiple Drones Based on Adaptive Large Neighborhood Search[J]. Control and Decision, 2023, 38(1): 201-210.
2	AlMahamid F, Grolinger K. Autonomous Unmanned Aerial Vehicle Navigation Using Reinforcement Learning: A Systematic Review[J]. Engineering Applications of Artificial Intelligence, 2022, 115: 105321.
3	Xue Wentao, Wu Hangxing, Ye Hui, et al. An Improved Proximal Policy Optimization Method for Low-level Control of a Quadrotor[J]. Actuators, 2022, 11(4): 105.
4	Lee T, Leok M, McClamroch N H. Geometric Tracking Control of a Quadrotor UAV on SE(3)[C]//49th IEEE conference on decision and control (CDC). Piscataway: IEEE, 2010: 5420-5425.
5	Kamel Mina, Burri Michael, Siegwart Roland. Linear vs Nonlinear MPC for Trajectory Tracking Applied to Rotary Wing Micro Aerial Vehicles[J]. IFAC-PapersOnLine, 2017, 50(1): 3463-3469.
6	Pi Chenhuan, Ye Weiyuan, Cheng S. Robust Quadrotor Control Through Reinforcement Learning with Disturbance Compensation[J]. Applied Sciences, 2021, 11(7): 3257.
7	Lambert N O, Drew D S, Yaconelli J, et al. Low-level Control of a Quadrotor with Deep Model-based Reinforcement Learning[J]. IEEE Robotics and Automation Letters, 2019, 4(4): 4224-4230.
8	董豪, 杨静, 李少波, 等. 基于深度强化学习的机器人运动控制研究进展[J]. 控制与决策, 2022, 37(2): 278-292.
	Dong Hao, Yang Jing, Li Shaobo, et al. Research Progress of Robot Motion Control Based on Deep Reinforcement Learning[J]. Control and Decision, 2022, 37(2): 278-292.
9	Hwangbo Jemin, Sa Inkyu, Siegwart Roland, et al. Control of a Quadrotor with Reinforcement Learning[J]. IEEE Robotics and Automation Letters, 2017, 2(4): 2096-2103.
10	Koch W, Mancuso R, West R, et al. Reinforcement Learning for UAV Attitude Control[J]. ACM Transactions on Cyber-Physical Systems, 2019, 3(2): 22.
11	Koch W, Mancuso R, Bestavros A. Neuroflight: Next Generation Flight Control Firmware[EB/OL]. (2019-09-16) [2022-10-06]. .
12	Guilherme Cano Lopes, Ferreira Murillo, Alexandre da Silva Simões, et al. Intelligent Control of a Quadrotor with Proximal Policy Optimization Reinforcement Learning[C]//2018 Latin American Robotic Symposium, 2018 Brazilian Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education (WRE). Piscataway: IEEE, 2018: 503-508.
13	Shehab Mazen, Zaghloul Ahmed, El-Badawy Ayman. Low-level Control of a Quadrotor Using Twin Delayed Deep Deterministic Policy Gradient (TD3)[C]//2021 18th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE). Piscataway: IEEE, 2021: 1-6.
14	Gabriel Moraes Barros, Esther Luna Colombini. Using Soft Actor-critic for Low-level UAV Control[EB/OL]. (2020-10-05) [2023-10-06]. .
15	Wang Yuanda, Sun Jia, He Haibo, et al. Deterministic Policy Gradient with Integral Compensator for Robust Quadrotor Control[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(10): 3713-3725.
16	Barzegar Ali, Jin Lee Deok. Deep Reinforcement Learning-based Adaptive Controller for Trajectory Tracking and Altitude Control of an Aerial Robot[J]. Applied Sciences, 2022, 12(9): 4764.
17	梁吉, 王立松, 黄昱洲, 等. 基于深度强化学习的四旋翼无人机自主控制方法[J]. 计算机科学, 2023, 50(增2): 1-7.
	Liang Ji, Wang Lisong, Huang Yuzhou, et al. Autonomous Control Algorithm for Quadrotor Based on Deep Reinforcement Learning[J]. Computer Science, 2023, 50(S2): 1-7.
18	王伟, 吴昊, 刘鸿勋, 等. 基于深度强化学习的无人机姿态控制器设计[J]. 科学技术与工程, 2023, 23(34): 14888-14895.
	Wang Wei, Wu Hao, Liu Hongxun, et al. An Attitude Controller for Quadrotor Drone Using RM-DDPG[J]. Science Technology and Engineering, 2023, 23(34): 14888-14895.
19	孙丹, 高东, 郑建华, 等. 引入积分补偿的四旋翼确定性策略梯度控制器[J]. 计算机工程与设计, 2023, 44(1): 255-261.
	Sun Dan, Gao Dong, Zheng Jianhua, et al. Deterministic Policy Gradient Controller with integral compensator for quadrotor[J]. Computer Engineering and Design, 2023, 44(1): 255-261.
20	杨志鹏, 李波, 甘志刚, 等. 基于深度强化学习的四旋翼无人机航线跟随[J]. 指挥与控制学报, 2022, 8(4): 477-482.
	Yang Zhipeng, Li Bo, Gan Zhigang, et al. Route Following of Quadrotor UAV Based on Deep Reinforcement Learning[J]. Journal of Command and Control, 2022, 8(4): 477-482.
21	孙丹, 高东, 郑建华, 等. 示教知识辅助的无人机强化学习控制算法[J]. 北京航空航天大学学报, 2023, 49(6): 1424-1433.
	Sun Dan, Gao Dong, Zheng Jianhua, et al. UAV Reinforcement Learning Control Algorithm with Demonstrations[J]. Journal of Beijing University of Aeronautics and Astronautics, 2023, 49(6): 1424-1433.
22	刘安林, 时正华. 基于DDPG策略的四旋翼飞行器目标高度控制[J]. 陕西科技大学学报, 2021, 39(6): 141-147.
	Liu Anlin, Shi Zhenghua. Desired Height Control of Quadrotor Vehicle Based on DDPG Strategy[J]. Journal of Shaanxi University of Science & Technology, 2021, 39(6): 141-147.
23	Molchanov A, Chen Tao, Hönig Wolfgang, et al. Sim-to-(multi)-real: Transfer of Low-level Robust Control Policies to Multiple Quadrotors[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2019: 59-66.
24	Kaufmann Elia, Bauersfeld Leonard, Scaramuzza Davide. A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight[C]//2022 International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 2022: 10504-10510.
25	Song Yunlong, Steinweg Mats, Kaufmann Elia, et al. Autonomous Drone Racing with Deep Reinforcement Learning[C]//2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2021: 1205-1212.
26	Penicka Robert, Song Yunlong, Kaufmann Elia, et al. Learning Minimum-time Flight in Cluttered Environments[J]. IEEE Robotics and Automation Letters, 2022, 7(3): 7209-7216.
27	Wu Guohua, Mao Ni, Luo Qizhang, et al. Collaborative Truck-drone Routing for Contactless Parcel Delivery During the Epidemic[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(12): 25077-25091.
28	Xu Binjie, Zhao Kexin, Luo Qizhang, et al. A GV-drone Arc Routing Approach for Urban Traffic Patrol by Coordinating a Ground Vehicle and Multiple Drones[J]. Swarm and Evolutionary Computation, 2023, 77: 101246.
29	Faessler Matthias, Franchi Antonio, Scaramuzza Davide. Differential Flatness of Quadrotor Dynamics Subject to Rotor Drag for Accurate Tracking of High-speed Trajectories[J]. IEEE Robotics and Automation Letters, 2018, 3(2): 620-626.
30	Hart P E, Nilsson N J, Raphael B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths[J]. IEEE Transactions on Systems Science and Cybernetics, 1968, 4(2): 100-107.
31	Mellinger D, Kumar V. Minimum Snap Trajectory Generation and Control for Quadrotors[C]//2011 IEEE international conference on robotics and automation. Piscataway: IEEE, 2011: 2520-2525.
32	Kirkpatrick S, Gelatt C D Jr, Vecchi M P. Optimization by Simulated Annealing[J]. Science, 1983, 220(4598): 671-680.
33	Schulman J, Levine S, Moritz P, et al. Trust Region Policy Optimization[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2015: 1889-1897.
34	Williams R J. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning[J]. Machine Learning, 1992, 8(3): 229-256.
35	Schulman J, Moritz P, Levine S, et al. High-dimensional Continuous Control Using Generalized Advantage Estimation[EB/OL]. (2018-10-20) [2023-10-06]. .
36	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-08-28) [2023-10-06]. .
37	Ilyas A, Engstrom L, Santurkar S, et al. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?[EB/OL]. (2020-05-25) [2023-10-06]. .
38	Chu Xiangxiang. Policy Optimization with Penalized Point Probability Distance: An Alternative to Proximal Policy Optimization[EB/OL]. (2019-02-14) [2023-10-06]. .
39	Haarnoja T, Tang Haoran, Abbeel P, et al. Reinforcement Learning with Deep Energy-based Policies[C]//Proceedings of the 34th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2017: 1352-1361.
40	Tucker G, Bhupatiraju S, Gu Shixiang, et al. The Mirage of Action-dependent Baselines in Reinforcement Learning[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 5015-5024.
41	Engstrom L, Ilyas A, Santurkar S, et al. Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO[EB/OL]. (2020-05-25) [2023-10-06]. .
42	Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
43	Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[EB/OL]. (2017-01-30) [2023-10-06]. .
44	Rohmer Eric, Singh S P N, Freese Marc. V-REP: A Versatile and Scalable Robot Simulation Framework[C]//2013 IEEE/RSJ international conference on intelligent robots and systems. Piscataway: IEEE, 2013: 1321-1326.
45	James S, Freese M, Davison A J. PyRep: Bringing V-REP to Deep Robot Learning[EB/OL]. (2019-06-26) [2023-10-06]. .
46	Förster Julian. System Identification of the Crazyflie 2.0 Nano Quadrocopter[D]. Zurich: ETH Zurich, 2015.

航迹	SAC(S)	PPO(S)	PPO-PPD(S)	PPO-SAG(P)	PPO-SAG(S)	PPO-SAG (FC-M)	PPO-SAG(M)	PID
1	0.196 ± 0.095	0.072 ± 0.036	0.061 ± 0.061	1.055 ± 0.430	0.054 ± 0.037	0.092 ± 0.064	0.087 ± 0.052	0.248 ± 0.073
2	0.473 ± 0.198	0.048 ± 0.032	0.038 ± 0.032	1.024 ± 0.459	0.030 ± 0.021	0.087 ± 0.074	0.075 ± 0.048	0.227 ± 0.078
3	0.404 ± 0.128	0.092 ± 0.208	0.064 ± 0.084	1.138 ± 0.369	0.038 ± 0.079	0.132 ± 0.106	0.107 ± 0.077	0.267 ± 0.082
4	0.344 ± 0.232	0.087 ± 0.105	0.073 ± 0.104	1.142 ± 0.420	0.046 ± 0.026	0.151 ± 0.142	0.110 ± 0.074	0.273 ± 0.212
5	0.979 ± 0.425	0.116 ± 0.227	0.077 ± 0.079	1.372 ± 0.471	0.068 ± 0.053	0.189 ± 0.371	0.152 ± 0.088	0.332 ± 0.113
6	0.451 ± 0.197	0.121 ± 0.246	0.091 ± 0.187	1.201 ± 0.436	0.062 ± 0.093	0.149 ± 0.189	0.147 ± 0.124	0.279 ± 0.085

航迹	PPO-SAG (S)	PPO-SAG (M)	PID
1	0.084 ± 0.174	0.120 ± 0.070	0.254 ± 0.096
2	0.054 ± 0.218	0.104 ± 0.116	0.231 ± 0.100
3	0.086 ± 0.252	0.159 ± 0.140	0.273 ± 0.103
4	0.105 ± 0.276	0.162 ± 0.181	0.293 ± 0.185
5	0.125 ± 0.289	0.238 ± 0.285	0.371 ± 0.472
6	0.131 ± 0.335	0.205 ± 0.245	0.289 ± 0.137

[1]	王祥, 谭国真. 基于知识与大语言模型的高速环境自动驾驶决策研究[J]. 系统仿真学报, 2025, 37(5): 1246-1255.
[2]	李杰, 刘扬, 李良, 苏本淦, 魏佳隆, 周广达, 石艳敏, 赵振. 基于跨阶段双分支特征聚合的遥感小目标检测[J]. 系统仿真学报, 2025, 37(4): 1025-1040.
[3]	张森, 代强强. 改进型深度确定性策略梯度的无人机路径规划[J]. 系统仿真学报, 2025, 37(4): 875-881.
[4]	李敏, 张森, 曾祥光, 王刚, 张童伟, 谢地杰, 任文哲, 张滔. 基于深度强化学习的四足机器人单腿越障轨迹规划[J]. 系统仿真学报, 2025, 37(4): 895-909.
[5]	郑岚月, 张玉洁. 基于改进YOLOv7的交通信号灯检测[J]. 系统仿真学报, 2025, 37(4): 993-1007.
[6]	王贺, 许佳宁, 闫广宇. 基于深度强化学习的AGV行人避让策略研究[J]. 系统仿真学报, 2025, 37(3): 595-606.
[7]	张斌, 雷永林, 李群, 高远, 陈永, 朱佳俊, 鲍琛龙. 基于强化学习的导弹突防决策建模研究[J]. 系统仿真学报, 2025, 37(3): 763-774.
[8]	黄思进, 文佳, 陈哲毅. 面向边缘车联网系统的智能服务迁移方法[J]. 系统仿真学报, 2025, 37(2): 379-391.
[9]	李想, 任晓羽, 周永兵, 张剑. 基于改进D3QN算法的随机工时下柔性综合调度问题研究[J]. 系统仿真学报, 2025, 37(2): 474-486.
[10]	费帅迪, 蔡长龙, 刘飞, 陈明晖, 刘晓明. 舰船防空反导的目标分配方法研究[J]. 系统仿真学报, 2025, 37(2): 508-516.
[11]	姜嘉成, 贾政轩, 徐钊, 林廷宇, 赵芃芃, 欧一鸣. 基于博弈对抗复杂系统的决策建模与求解[J]. 系统仿真学报, 2025, 37(1): 66-78.
[12]	李超, 李贾宝, 丁才昌, 叶志伟, 左方威. 基于DRL的边缘监控任务卸载与资源分配算法[J]. 系统仿真学报, 2024, 36(9): 2113-2126.
[13]	刘沛津, 付雪峰, 孙浩峰, 何林, 刘淑婕. 一种融合CNN与Transformer的高鲁棒性目标跟踪算法[J]. 系统仿真学报, 2024, 36(8): 1854-1868.
[14]	路阳, 刘鹏飞, 许思源, 刘启旺, 顾福谦, 王鹏. 改进注意力机制嵌入PR-Net模型的水稻病害识别仿真[J]. 系统仿真学报, 2024, 36(6): 1322-1333.
[15]	王红军, 林俊强, 邹湘军, 张坡, 周铭轩, 邹伟锐, 唐昀超, 罗陆锋. 基于数字孪生的果园虚拟交互系统构建[J]. 系统仿真学报, 2024, 36(6): 1493-1508.

基于深度强化学习的四旋翼航迹跟踪控制方法

A Quadrotor Trajectory Tracking Control Method Based on Deep Reinforcement Learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 24

参考文献 46

相关文章 15

编辑推荐

Metrics

本文评价