系统仿真学报 ›› 2026, Vol. 38 ›› Issue (3): 714-724.doi: 10.16182/j.issn1004731x.joss.25-0399

• 论文 • 上一篇    

基于SAC3Q-HDM的强化学习机器人路径规划

李德权1,2, 熊婉1   

  1. 1.安徽理工大学 人工智能学院,安徽 合肥 231131
    2.安徽理工大学 煤炭无人化开采数智技术全国重点实验室,安徽 淮南 232001
  • 收稿日期:2025-05-09 修回日期:2025-07-25 出版日期:2026-03-18 发布日期:2026-03-27
  • 第一作者简介:李德权(1973-),男,教授,博士,研究方向为人工智能与智能控制技术。
  • 基金资助:
    国家重点研发计划(2023YFC3807500)

Robot Path Planning by Reinforcement Learning Based on SAC3Q-HDM

Li Dequan1,2, Xiong Wan1   

  1. 1.School of Artificial Intelligence, Anhui University of Science and Technology, Hefei 231131, China
    2.State Key Laboratory of Digital Intelligent Technology for Unmanned Coal Mining, Anhui University of Science and Technology, Huainan 232001, China
  • Received:2025-05-09 Revised:2025-07-25 Online:2026-03-18 Published:2026-03-27

摘要:

针对强化学习在路径规划中存在的高估和低估偏差、样本利用率低、无法平衡探索和利用等问题,提出一种改进型SAC方法。通过自适应温度系数调节熵的大小平衡探索与利用在SAC框架基础上引入三重Critic架构,通过Q值不确定性动态加权融合最小值与平均值,平衡高估和低估偏差;设计混合动态采样经验回放缓冲区,根据奖励阈值划分经验数据,动态调整采样比例,实现从核心策略到全面泛化的渐进式学习;设计层次化启发式奖励函数,引导机器人在任务中平衡目标趋近、障碍规避的多目标需求。仿真实验结果表明:改进后的算法在路径长度、规划时间和成功率等方面具有明显优势,提升了路径规划的效率和鲁棒性。

关键词: 强化学习, 路径规划, SAC, 混合动态采样, 层次化启发式奖励函数

Abstract:

To address the issues of overestimated and underestimated biases, low sample utilization rate, and the inability to balance exploration and exploitation in reinforcement learning for path planning, an improved SAC method was proposed. The size balance of entropy was explored and utilized through adaptive temperature coefficient adjustment; on the basis of the SAC framework, a triple Critic architecture was introduced to dynamically weight and fuse the minimum and average values through Q-value uncertainty, balancing overestimated and underestimated biases. A mixed dynamic sampling experience replay buffer was designed; experience data was partitioned based on reward thresholds; sampling ratios were dynamically adjusted to achieve progressive learning from core strategies to comprehensive generalization. A hierarchical heuristic reward function was designed to guide robots to balance the multi-objective needs of approaching goals and avoiding obstacles in tasks. The simulation experiment results demonstrate that the improved algorithm outperforms in several aspects such as path length, planning time, and success rate, enhancing both efficiency and robustness in path planning.

Key words: reinforcement learning, path planning, SAC, hybrid dynamic sampling, hierarchical heuristic reward function

中图分类号: