Multi-step Information Aided Q-learning Path Planning Algorithm

doi:10.16182/j.issn1004731x.joss.23-0543

Abstract

Abstract:

To improve the path planning capability of mobile robots in a static environment and solve the problem of slow convergence of the traditional Q-learning algorithm in path planning, this paper proposes a multi-step information-aided Q-learning improvement algorithm. Using the multi-step information of greedy action in ε-greedy strategy and length of the historical optimal path to update the eligibility traces, which makes the effective eligibility traces work continuously in the iteration of the algorithm and solves the loop traps that may fall into with the preserved multi-step information; using the local multi-flower pollination algorithm to initialize the Q-value table to improve the robot's pre-search efficiency; based on the purpose of different exploration stages of the robot, the action selection strategy is designed by combining the standard deviation of the iterative path length with the number of times the robot successfully reaches the target point to enhance the algorithm's ability to balance the exploration and exploitation of environmental information. The experimental results prove that the algorithm proposed in this paper has a fast convergence speed, which verifies the feasibility and effectiveness of the algorithm.

Key words: path planning, Q-learning, convergence speed, action selection strategy, grid map

CLC Number:

TP391.9

Wang Yuelong, Wang Songyan, Chao Tao. Multi-step Information Aided Q-learning Path Planning Algorithm[J]. Journal of System Simulation, 2024, 36(9): 2137-2148.

Figures/Tables 16

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Table 1

Fig. 5

Table 2

Fig. 6

Table 3

Fig. 7

Table 4

Fig. 8

Table 5

Fig. 9

Table 6

Fig. 10

References 20

1	林韩熙, 向丹, 欧阳剑, 等. 移动机器人路径规划算法的研究综述[J]. 计算机工程与应用, 2021, 57(18): 38-48.
	Lin Hanxi, Xiang Dan, Ouyang Jian, et al. Review of Path Planning Algorithms for Mobile Robots[J]. Computer Engineering and Applications, 2021, 57(18): 38-48.
2	Tan Bin, Peng Yinyin, Lin Jiugen. A Local Path Planning Method Based on Q-learning[C]//2021 International Conference on Signal Processing and Machine Learning (CONF-SPML). Piscataway, NJ, USA: IEEE, 2021: 80-84.
3	徐晓苏, 袁杰. 基于改进强化学习的移动机器人路径规划方法[J]. 中国惯性技术学报, 2019, 27(3): 314-320.
	Xu Xiaosu, Yuan Jie. Path Planning for Mobile Robot Based on Improved Reinforcement Learning Algorithm[J]. Journal of Chinese Inertial Technology, 2019, 27(3): 314-320.
4	Ee Soong Low, Ong Pauline, Kah Chun Cheah. Solving the Optimal Path Planning of a Mobile Robot Using Improved Q-learning[J]. Robotics and Autonomous Systems, 2019, 115: 143-161.
5	毛国君, 顾世民. 改进的Q-Learning算法及其在路径规划中的应用[J]. 太原理工大学学报, 2021, 52(1): 91-97.
	Mao Guojun, Gu Shimin. An Improved Q-learning Algorithm and Its Application in Path Planning[J]. Journal of Taiyuan University of Technology, 2021, 52(1): 91-97.
6	田晓航, 霍鑫, 周典乐, 等. 基于蚁群信息素辅助的Q学习路径规划算法[J]. 控制与决策, 2023, 38(12): 3345-3353.
	Tian Xiaohang, Huo Xin, Zhou Dianle, et al. Ant Colony Pheromone Aided Q-learning Path Planning Algorithm[J]. Control and Decision, 2023, 38(12): 3345-3353.
7	Peng Jing, Williams R J. Incremental Multi-step Q-learning[J]. Machine Learning, 1996, 22(1): 283-290.
8	唐恒亮, 唐滋芳, 董晨刚, 等. 基于启发式强化学习的AGV路径规划[J]. 北京工业大学学报, 2021, 47(8): 895-903.
	Tang Hengliang, Tang Zifang, Dong Chengang, et al. AGV Path Planning Based on Heuristic Reinforcement Learning[J]. Journal of Beijing University of Technology, 2021, 47(8): 895-903.
9	付虹, 王国志, 柯坚, 等. 基于启发式Q(λ)学习的铁路绝缘子定位研究[J]. 铁道标准设计, 2018, 62(4): 151-155.
	Fu Hong, Wang Guozhi, Ke Jian, et al. Research on Location of Railway Insulators Based on Heuristic Q(λ) Learning[J]. Railway Standard Design, 2018, 62(4): 151-155.
10	闫丰亭, 贾金原. DP-Q(λ):大规模Web3D场景中Multi-agent实时路径规划算法[J]. 系统仿真学报, 2019, 31(1): 16-26.
	Yan Fengting, Jia Jinyuan. DP-Q(λ): Real-time Path Planning for Multi-agent in Large-scale Web3D Scene[J]. Journal of System Simulation, 2019, 31(1): 16-26.
11	余涛, 王宇名, 甄卫国, 等. 基于多步回溯Q学习的自动发电控制指令动态优化分配算法[J]. 控制理论与应用, 2011, 28(1): 58-64.
	Yu Tao, Wang Yuming, Zhen Weiguo, et al. Multi-step Backtrack Q-learning Based Dynamic Optimal Algorithm for Auto Generation Control Order Dispatch[J]. Control Theory & Applications, 2011, 28(1): 58-64.
12	傅启明, 刘全, 王辉, 等. 一种基于线性函数逼近的离策略Q(λ)算法[J]. 计算机学报, 2014, 37(3): 677-686.
	Fu Qiming, Liu Quan, Wang Hui, et al. A Novel off Policy Q(λ) Algorithm Based on Linear Function Approximation[J]. Chinese Journal of Computers, 2014, 37(3): 677-686.
13	陈圣磊, 吴慧中, 肖亮, 等. 基于Metropolis准则的多步Q学习算法与性能仿真[J]. 系统仿真学报, 2007, 19(6): 1284-1287.
	Chen Shenglei, Wu Huizhong, Xiao Liang, et al. Metropolis Policy-based Multi-step Q learning Algorithm and Performance Simulation[J]. Journal of System Simulation, 2007, 19(6): 1284-1287.
14	刘仕超. 基于强化学习的移动机器人路径规划研究[D]. 青岛: 山东科技大学, 2017.
	Liu Shichao. The Research of Mobile Robot Patn Planning Based on Reinforcement Learning[D]. Qingdao: Shandong University of Science and Technology, 2017.
15	李涛, 赵宏生. 基于进化蚁群算法的移动机器人路径优化[J]. 控制与决策, 2023, 38(3): 612-620.
	Li Tao, Zhao Hongsheng. Path Optimization for Mobile Robot Based on Evolutionary Ant Colony Algorithm[J]. Control and Decision, 2023, 38(3): 612-620.
16	汪荣贵, 杨娟, 薛丽霞. 机器学习及其应用[M]. 北京: 机械工业出版社, 2019.
17	余涛, 胡细兵, 刘靖. 基于多步回溯Q(λ)学习算法的多目标最优潮流计算[J]. 华南理工大学学报(自然科学版), 2010, 38(10): 139-145.
	Yu Tao, Hu Xibing, Liu Jing. Multi-objective Optimal Power Flow Calculation Based on Multi-step Q(λ) Learning Algorithm[J]. Journal of South China University of Technology(Natural Science Edition), 2010, 38(10): 139-145.
18	马朋委. Q_learning强化学习算法的改进及应用研究[D]. 淮南: 安徽理工大学, 2016.
	Ma Pengwei. The Improvement and Application of Reinforcement Learning Algorithm Research[D]. Huainan: Anhui University of Science& Technology, 2016.
19	Yang Xinshe. Flower Pollination Algorithm for Global Optimization[C]//The 11th International Conference on Unconventional Computation and Natural Computation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 240-249.
20	Sutton R S, Barto A G. Reinforcement Learning: An Introduction[M]. 2nd ed. Cambridge: The MIT Press, 2018.

参数	数值
学习率α	0.4
折扣因子γ	0.95
探索因子初始值ε₀	0.1
目标点回报r_p	2
障碍物回报r_o	-0.2
自由行走回报r_w	-0.1
最大迭代次数n_max	10 000
最大路径长度L_max	1 000
收敛最大标准差s_max	1

参数	TRAD_Q-learning	FPA_Q-learning	IMP_Q-learning	MIMP_Q-learning
最优路径长度/m	28	28	28	28
改善效果/%		0	0	0
平均迭代次数	1 075.5	800.7	549.2	111.8
改善效果/%		25.550 9	48.935 4	89.604 8
平均收敛时间/s	0.364 1	0.290 9	0.276 2	0.127 7
改善效果/%		14.470 8	18.796 9	62.433 7

参数	TRAD_Q-learning	FPA_Q-learning	IMP_Q-learning	MIMP_Q-learning
最优路径长度/m	58	58	58	58
改善效果/%		0	0	0
平均迭代次数	7 443.0	6 484.9	2 704.2	562.8
改善效果/%		12.872 5	63.667 9	92.438 5
平均收敛时间/s	3.189 1	2.851 5	1.882 8	0.972 1
改善效果/%		10.586 3	40.960 7	69.518 5

参数	TRAD_Q-learning	FPA_Q-learning	IMP_Q-learning	MIMP_Q-learning
最优路径长度/m	78	78	78	78
改善效果/%		0	0	0
平均迭代次数	10 000.0	9 252.2	4 832.3	1 028.7
改善效果/%		7.478	51.677	89.713
平均收敛时间/s	6.266 8	5.612 8	4.361 5	2.570 0
改善效果/%		10.435 9	30.402 7	58.989 8

参数	TRAD_Q-learning	FPA_Q-learning	IMP_Q-learning	MIMP_Q-learning
最优路径长度/m	58	58	58	58
改善效果/%		0	0	0
平均迭代次数	7 642.4	7 663.3	1 732.0	367.0
改善效果/%		-0.273 5	77.337 0	95.197 8
平均收敛时间/s	3.260 3	3.175 3	1.115 9	0.717 6
改善效果/%		2.607 5	55.652 0	77.988 4