基于SAC3Q-HDM的强化学习机器人路径规划

doi:10.16182/j.issn1004731x.joss.25-0399

摘要/Abstract

摘要：

针对强化学习在路径规划中存在的高估和低估偏差、样本利用率低、无法平衡探索和利用等问题，提出一种改进型SAC方法。通过自适应温度系数调节熵的大小平衡探索与利用；在SAC框架基础上引入三重Critic架构，通过Q值不确定性动态加权融合最小值与平均值，平衡高估和低估偏差；设计混合动态采样经验回放缓冲区，根据奖励阈值划分经验数据，动态调整采样比例，实现从核心策略到全面泛化的渐进式学习；设计层次化启发式奖励函数，引导机器人在任务中平衡目标趋近、障碍规避的多目标需求。仿真实验结果表明：改进后的算法在路径长度、规划时间和成功率等方面具有明显优势，提升了路径规划的效率和鲁棒性。

关键词: 强化学习, 路径规划, SAC, 混合动态采样, 层次化启发式奖励函数

Abstract:

To address the issues of overestimated and underestimated biases, low sample utilization rate, and the inability to balance exploration and exploitation in reinforcement learning for path planning, an improved SAC method was proposed. The size balance of entropy was explored and utilized through adaptive temperature coefficient adjustment; on the basis of the SAC framework, a triple Critic architecture was introduced to dynamically weight and fuse the minimum and average values through Q-value uncertainty, balancing overestimated and underestimated biases. A mixed dynamic sampling experience replay buffer was designed; experience data was partitioned based on reward thresholds; sampling ratios were dynamically adjusted to achieve progressive learning from core strategies to comprehensive generalization. A hierarchical heuristic reward function was designed to guide robots to balance the multi-objective needs of approaching goals and avoiding obstacles in tasks. The simulation experiment results demonstrate that the improved algorithm outperforms in several aspects such as path length, planning time, and success rate, enhancing both efficiency and robustness in path planning.

Key words: reinforcement learning, path planning, SAC, hybrid dynamic sampling, hierarchical heuristic reward function

中图分类号:

TP242

李德权,熊婉 . 基于SAC3Q-HDM的强化学习机器人路径规划[J]. 系统仿真学报, 2026, 38(3): 714-724.

Li Dequan,Xiong Wan . Robot Path Planning by Reinforcement Learning Based on SAC3Q-HDM[J]. Journal of System Simulation, 2026, 38(3): 714-724.

图/表 16

图1

图2

图3

图4

图5

表1

图6

图7

图8

图9

图10

表2

表3

图11

图12

图13

参考文献 20

[1]	崔炜, 朱发证. 机器人导航的路径规划算法研究综述[J]. 计算机工程与应用, 2023, 59(19): 10-20.
	Cui Wei, Zhu Fazheng. Review of Path Planning Algorithms for Robot Navigation[J]. Computer Engineering and Applications, 2023, 59(19): 10-20.
[2]	姚得鑫, 伞红军, 王雅如, 等. 移动机器人路径规划中A*算法的改进研究[J]. 系统仿真学报, 2024, 36(11): 2684-2698.
	Yao Dexin, Hongjun San, Wang Yaru, et al. Improvement of A* Algorithm in Path Planning of Mobile Robot[J]. Journal of System Simulation, 2024, 36(11): 2684-2698.
[3]	巩慧, 倪翠, 王朋, 等. 基于Dijkstra算法的平滑路径规划方法[J]. 北京航空航天大学学报, 2024, 50(2): 535-541.
	Gong Hui, Ni Cui, Wang Peng, et al. A Smooth Path Planning Method Based on Dijkstra Algorithm[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(2): 535-541.
[4]	徐强, 徐坚磊, 胡燕海, 等. 基于改进模拟退火遗传算法的机械臂轨迹优化[J]. 系统仿真学报, 2025, 37(2): 404-412.
	Xu Qiang, Xu Jianlei, Hu Yanhai, et al. Trajectory Optimization of Robotic Arm Based on Improved Simulated Annealing Genetic Algorithm[J]. Journal of System Simulation, 2025, 37(2): 404-412.
[5]	鲍惠芳, 方杰, 张进思, 等. 基于改进蚁群算法的低碳冷链配送路径优化[J]. 系统仿真学报, 2024, 36(1): 183-194.
	Bao Huifang, Fang Jie, Zhang Jinsi, et al. Optimization on Cold Chain Distribution Routes Considering Carbon Emissions Based on Improved Ant Colony Algorithm[J]. Journal of System Simulation, 2024, 36(1): 183-194.
[6]	Gao Kaizhou, Gao Minglong, Zhou Mengchu, et al. Artificial Intelligence Algorithms in Unmanned Surface Vessel Task Assignment and Path Planning: A Survey[J]. Swarm and Evolutionary Computation, 2024, 86: 101505.
[7]	Wu Jingda, Huang Chao, Huang Hailong, et al. Recent Advances in Reinforcement Learning-based Autonomous Driving Behavior Planning: A Survey[J]. Transportation Research Part C: Emerging Technologies, 2024, 164: 104654.
[8]	Quinones-Ramirez Miguel, Rios-Martinez Jorge, Uc-Cetina Victor. Robot Path Planning Using Deep Reinforcement Learning[EB/OL]. (2023-03-06) [2025-03-18]. .
[9]	Zhou Qian, Lian Yang, Wu Jiayang, et al. An Optimized Q-learning Algorithm for Mobile Robot Local Path Planning[J]. Knowledge-Based Systems, 2024, 286: 111400.
[10]	Demelash Abiye Deguale, Yu Lingli, Melikamu Liyih Sinishaw, et al. Enhancing Stability and Performance in Mobile Robot Path Planning with PMR-Dueling DQN Algorithm[J]. Sensors, 2024, 24(5): 1523.
[11]	Lillicrap Timothy P, Hunt Jonathan J, Pritzel Alexander, et al. Continuous Control with Deep Reinforcement Learning[EB/OL]. (2019-07-05) [2025-03-18]. .
[12]	Fujimoto Scott, Hoof Herke, Meger David. Addressing Function Approximation Error in Actor-Critic Methods[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1587-1596.
[13]	Zhang Yinmin, Liu Jie, Li Chuming, et al. A Perspective of Q-value Estimation on Offline-to-online Reinforcement Learning[C]//Proceedings of the Thirty-eighth AAAI Conference on Artificial Intelligence and Thirty-sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2024: 16908-16916.
[14]	Ly Adrian, Dazeley Richard, Vamplew Peter, et al. Elastic Step DQN: A Novel Multi-step Algorithm to Alleviate Overestimation in Deep Q-Networks[J]. Neurocomputing, 2024, 576: 127170.
[15]	Wang Zhengjun, Gao Weifeng, Li Genghui, et al. Path Planning for Unmanned Aerial Vehicle via Off-policy Reinforcement Learning with Enhanced Exploration[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024, 8(3): 2625-2639.
[16]	Luo Xuqiong, Wang Qiyuan, Gong Hongfang, et al. UAV Path Planning Based on the Average TD3 Algorithm with Prioritized Experience Replay[J]. IEEE Access, 2024, 12: 38017-38029
[17]	Cimurs Reinis, Il Hong Suh, Jin Han Lee. Goal-driven Autonomous Exploration Through Deep Reinforcement Learning[J]. IEEE Robotics and Automation Letters, 2022, 7(2): 730-737.
[18]	Wang Jiaqi, Han Huiyan, Han Xie, et al. Reinforcement Learning Path Planning Method Incorporating Multi-step Hindsight Experience Replay for Lightweight Robots[J]. Displays, 2024, 84: 102796.
[19]	Khlif Nesrine, Nahla Khraief, Safya Belghith. Reinforcement Learning with Modified Exploration Strategy for Mobile Robot Path Planning[J]. Robotica, 2023, 41(9): 2688-2702.
[20]	Haarnoja Tuomas, Zhou Aurick, Hartikainen Kristian, et al. Soft Actor-Critic Algorithms and Applications[EB/OL]. (2019-01-29) [2025-03-18]. .

主要参数	量值
训练总周期	2 500
学习率	0.000 3
缓冲区大小	300 000
采样批量大小	128
软更新参数	0.005
奖励阈值	-0.3
平滑系数	0.5
比例系数	10

算法	训练总步数	奖励	步数	成功率
DDPG	475 342	100~110	140~160	0.30~0.40
TD3	288 016	150~155	90~100	0.65~0.70
SAC	282 915	150~160	85~90	0.70~0.75
SAC3Q	234 956	160~165	80~87	0.75~0.78
SAC-HDM	217 925	163~170	80~85	0.75~0.80
SAC3Q-HDM	212 206	170~177	73~80	0.80~0.85

算法	环境I				环境II				环境III
	100次随机		10次固定		100次随机		10次固定		100次随机		10次固定
	AR	SR/%	AL	AT	AR	SR/%	AL	AT	AR	SR/%	AL	AT
DDPG	122.9	85	14.45	39.41	157.7	91	17.71	37.31	75.6	74	14.97	41.80
TD3	153.4	91	11.82	22.93	159.1	92	13.22	34.36	129.8	82	14.13	38.23
SAC	170.9	92	11.68	22.03	163.2	93	13.10	27.95	135.5	83	14.77	37.15
SAC3Q	173.0	94	11.38	21.18	177.1	95	11.97	26.58	147.3	88	13.68	28.44
SAC-HDM	174.7	95	11.46	21.34	176.5	95	12.68	27.32	141.6	86	13.74	28.62
SAC3Q-HDM	176.9	96	11.00	17.83	181.2	97	11.77	25.21	156.1	90	13.39	27.18