基于精英指导和随机搜索的进化强化学习

doi:10.16182/j.issn1004731x.joss.24-0604

摘要/Abstract

摘要：

针对进化强化学习因样本效率低、耦合方式单一及收敛性差而导致的性能与扩展性受限问题，提出一种基于精英梯度指导和双重随机搜索的改进算法。通过在强化策略训练时引入携带进化信息的精英策略梯度指导，纠正了强化策略梯度更新的方向；采用双重随机搜索替换原有的进化组件，降低算法复杂性的同时使得策略搜索在参数空间进行有意义和可控的搜索；引入完全替换信息交易有效地平衡了强化策略和进化策略的学习和探索。实验结果表明：该方法相比于经典的进化强化学习方法在探索力、鲁棒性和收敛性方面具有一定的提升。

关键词: 进化强化学习, 深度强化学习, 进化算法, 连续控制, 精英梯度指导

Abstract:

Evolutionary reinforcement learning currently suffers from low sample efficiency, a single coupling method, and poor convergence, which can affect its performance and scaling. To address this issue, an improved algorithm based on elite gradient instruction and double random search was proposed. The direction of the reinforcement strategy gradient update was corrected by introducing elite strategy gradient guidance carrying evolutionary information during reinforcement strategy training. Double stochastic search was used to replace the original evolutionary component, reducing the complexity of the algorithm while making the policy search meaningful and controllable in the parameter space. The introduction of complete replacement information trading effectively balanced the learning and search of reinforcement and evolutionary strategies. Experimental results show that the method has improved exploration power, robustness, and convergence compared to the classical evolutionary reinforcement learning method.

Key words: evolutionary reinforcement learning, deep reinforcement learning, evolutionary algorithm, continuous control, elite gradient instruction

中图分类号:

TP399

邸剑,万雪,姜丽梅 . 基于精英指导和随机搜索的进化强化学习[J]. 系统仿真学报, 2025, 37(11): 2877-2887.

Di Jian,Wan Xue,Jiang Limei . Evolutionary Reinforcement Learning Based on Elite Instruction and Random Search[J]. Journal of System Simulation, 2025, 37(11): 2877-2887.

图/表 7

图1

图2

表1

图3

图4

图5

图6

参考文献 33

[1]	Lample G, Chaplot D S. Playing FPS Games with Deep Reinforcement Learning[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 2140-2146.
[2]	Nguyen H, La H. Review of Deep Reinforcement Learning for Robot Manipulation[C]//2019 Third IEEE International Conference on Robotic Computing (IRC). Piscataway: IEEE, 2019: 590-595.
[3]	Zhang Weiwei, Ji Ming, Yu Haoran, et al. ReLP: Reinforcement Learning Pruning Method Based on Prior Knowledge[J]. Neural Processing Letters, 2023, 55(4): 4661-4678.
[4]	Yang Yikun, He Jiarui, Chen Chunlin, et al. Balancing Awareness Fast Charging Control for Lithium-ion Battery Pack Using Deep Reinforcement Learning[J]. IEEE Transactions on Industrial Electronics, 2023, 71(4): 3718-3727.
[5]	安靖, 司光亚, 张雷. 基于深度强化学习的立体投送策略优化方法研究[J]. 系统仿真学报, 2024, 36(1): 39-49.
	An Jing, Si Guangya, Zhang Lei. Strategy Optimization Method of Multi-dimension Projection Based on Deep Reinforcement Learning[J]. Journal of System Simulation, 2024, 36(1): 39-49.
[6]	逄金辉, 冯子聪. 基于不确定性的深度强化学习探索方法综述[J]. 计算机应用研究, 2023, 40(11): 3201-3210.
	Pang Jinhui, Feng Zicong. Exploration Approaches in Deep Reinforcement Learning Based on Uncertainty: A Review[J]. Application Research of Computers, 2023, 40(11): 3201-3210.
[7]	Arulkumaran K, Deisenroth M P, Brundage M, et al. Deep Reinforcement Learning: A Brief Survey[J]. IEEE Signal Processing Magazine, 2017, 34(6): 26-38.
[8]	Salimans T, Ho J, Chen X, et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning[J]. arXiv preprint arXiv:, 2017.
[9]	Mania H, Guy A, Recht B. Simple Random Search of Static Linear Policies is Competitive for Reinforcement Learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1805-1814.
[10]	Slowik Adam, Kwasnicka Halina. Evolutionary Algorithms and Their Applications to Engineering Problems[J]. Neural Computing and Applications, 2020, 32(16): 12363-12379.
[11]	Li Jialian, Ren Tongzheng, Yan Dong, et al. Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 7417-7425.
[12]	Such F P, Madhavan V, Conti E, et al. Deep Neuroevolution: Genetic algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning[J]. arXiv Preprint arXiv:, 2017.
[13]	Sigaud Olivier. Combining Evolution and Deep Reinforcement Learning for Policy Search: A Survey[J]. ACM Transactions on Evolutionary Learning and Optimization, 2023, 3(3): 10.
[14]	Qian Hong, Yu Yang. Derivative-free Reinforcement Learning: A Review[J]. Frontiers of Computer Science, 2021, 15(6): 156336.
[15]	Khadka S, Tumer K. Evolution-guided Policy Gradient in Reinforcement Learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1196-1208.
[16]	Drugan Madalina M. Reinforcement Learning Versus Evolutionary Computation: A Survey on Hybrid Algorithms[J]. Swarm and Evolutionary Computation, 2019, 44: 228-246.
[17]	吕帅, 龚晓宇, 张正昊, 等. 结合进化算法的深度强化学习方法研究综述[J]. 计算机学报, 2022, 45(7): 1478-1499.
	Shuai Lü, Gong Xiaoyu, Zhang Zhenghao, et al. Survey of Deep Reinforcement Learning Methods with Evolutionary Algorithms[J]. Chinese Journal of Computers, 2022, 45(7): 1478-1499.
[18]	Moriarty D E, Schultz A C, Grefenstette J J. Evolutionary Algorithms for Reinforcement Learning[J]. Journal of Artificial Intelligence Research, 1999, 11(1): 241-276.
[19]	Whiteson S, Stone P. Evolutionary Function Approximation for Reinforcement Learning[J]. The Journal of Machine Learning Research, 2006, 7: 877-917.
[20]	王君逸, 王志, 李华雄, 等. 基于自适应噪声的最大熵进化强化学习方法[J]. 自动化学报, 2023, 49(1): 54-66.
	Wang Junyi, Wang Zhi, Li Huaxiong, et al. Adaptive Noise-based Evolutionary Reinforcement Learning with Maximum Entropy[J]. Acta Automatica Sinica, 2023, 49(1): 54-66.
[21]	Bodnar C, Day B, Lió Pietro. Proximal Distilled Evolutionary Reinforcement Learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 3283-3290.
[22]	Pourchot Aloïs, Sigaud Olivier. CEM-RL: Combining Evolutionary and Gradient-based Methods for Policy Search[C]//ICLR 2019. New York: ICLR, 2019: 1-18.
[23]	王尧, 罗俊仁, 周棪忠, 等. 面向策略探索的强化学习与进化计算方法综述[J]. 计算机科学, 2024, 51(3): 183-197.
	Wang Yao, Luo Junren, Zhou Yanzhong, et al. Review of Reinforcement Learning and Evolutionary Computation Methods for Strategy Exploration[J]. Computer Science, 2024, 51(3): 183-197.
[24]	Wang Yuxing, Zhang Tiantian, Chang Yongzhe, et al. A Surrogate-assisted Controller for Expensive Evolutionary Reinforcement Learning[J]. Information Sciences, 2022, 616: 539-557.
[25]	Shuai Lü, Han Shuai, Zhou Wenbo, et al. Recruitment-imitation Mechanism for Evolutionary Reinforcement Learning[J]. Information Sciences, 2021, 553: 172-188.
[26]	Chen Maiyue, He Guangyi. Efficient and Stable Off-policy Training via Behavior-aware Evolutionary Learning[C]//Proceedings of the 6th Conference on Robot Learning. Chia Laguna Resort: PMLR, 2023: 482-491.
[27]	Ma Yan, Liu Tianxing, Wei Bingsheng, et al. Evolutionary Action Selection for Gradient-based Policy Learning[C]//Neural Information Processing. Cham: Springer International Publishing, 2023: 579-590.
[28]	Dong Caibo, Li Dazi. Adaptive Evolutionary Reinforcement Learning with Policy Direction[J]. Neural Processing Letters, 2024, 56(2): 69.
[29]	Fujimoto S, van Hoof Herke, Meger D. Addressing Function Approximation Error in Actor-critic Methods[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1587-1596.
[30]	Fujimoto S, Gu Shixiang. A Minimalist Approach to Offline Reinforcement Learning[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 20132-20145.
[31]	彭坤彦, 尹翔, 刘笑竹, 等. 基于粒子群优化和深度强化学习的策略搜索方法[J]. 计算机工程与科学, 2023, 45(4): 718-725.
	Peng Kunyan, Yin Xiang, Liu Xiaozhu, et al. A Strategy Search Method Based on Particle Swarm Optimization and Deep Reinforcement Learning[J]. Computer Engineering & Science, 2023, 45(4): 718-725.
[32]	Suri K, Shi X Q, Plataniotis K N, et al. Maximum Mutation Reinforcement Learning for Scalable Control[J]. arXiv Preprint arXiv:, 2020.
[33]	Marchesini Enrico, Corsi Davide, Farinelli Alessandro. Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning[C]//ICLR 2021. New York: ICLR, 2021: 1-15.

任务	EDC-RL			CEM-TD3			ERL			TD3
任务	Mean	Median	Std	Mean	Median	Std	Mean	Median	Std	Mean	Median	Std
Swimmer-v2	365.85	365.09	1.8	151.97	138.10	105.36	152.14	149.54	67.77	83.22	77.15	28.32
HalfCheetah-v2	12 122.18	12 130.03	237.75	10 875.90	10 789.85	706.70	5 753.22	5 201.25	1 031.90	10 500.78	10 452.88	419.51
Walker2d-v2	4 679.29	4 402.94	400.48	4 126.37	4 151.47	477.87	3 984.24	4 278.72	635.45	3 488.45	3 571.11	461.32
Hopper -v2	3 823.54	3 771.83	150.33	3 133.02	3 714.95	1 277.34	3 129.92	3 057.53	267.24	3 165.54	3 555.34	742.16
Ant -v2	3 532.07	3 294.87	861.32	3 339.61	4 178.80	1 749.38	2 060.80	2 320.38	797.08	4 729.97	4 790.98	211.03