系统仿真学报 ›› 2025, Vol. 37 ›› Issue (7): 1753-1769.doi: 10.16182/j.issn1004731x.joss.25-0533

• 特约综述 • 上一篇    

深度强化学习中策略表征研究简述

陈真2,3, 吴卓屹2,3, 张霖1,2,3   

  1. 1.北京航空航天大学 杭州国际创新研究院,浙江 杭州 311115
    2.北京航空航天大学 自动化科学与电气工程学院,北京 100191
    3.复杂产品智能制造系统技术全国重点实验室,北京 100854
  • 收稿日期:2025-06-09 修回日期:2025-06-23 出版日期:2025-07-18 发布日期:2025-07-30
  • 通讯作者: 张霖
  • 第一作者简介:陈真(1998-),男,博士,研究方向为基于深度强化学习的组合优化。
  • 基金资助:
    国家自然科学基金(62373026)

Research on Policy Representation in Deep Reinforcement Learning

Chen Zhen2,3, Wu Zhuoyi2,3, Zhang Lin1,2,3   

  1. 1.Hangzhou International Innovation Institute, Beihang University, Beijing 311115, China
    2.School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
    3.State Key Laboratory of Intelligent Manufacturing Systems Technology, Beijing 100854, China
  • Received:2025-06-09 Revised:2025-06-23 Online:2025-07-18 Published:2025-07-30
  • Contact: Zhang Lin

摘要:

深度强化学习(deep reinforcement learning,DRL)在多个领域取得了显著成功,但DRL的策略网络在泛化性、多任务适应性和样本效率等方面仍面临巨大挑战。策略表征作为提升DRL能力的重要研究方向,通过构建更高效、更泛化的策略表达形式,提升智能体对环境变化及新任务的适应能力。概述了策略表征领域的关键研究进展,介绍了从传统的基于多层感知机(multi-layer perceptron,MLP)策略到基于指针网络、序列生成模型、扩散模型、超网络、模块化设计以及专家混合模型以及基于序列化Token的跨模态策略等多样化策略架构,还从策略输入和中间表达的语义如何编码和优化等策略表征方法层面归纳分析前沿研究。总结并对未来可能的发展趋势进行了展望。

关键词: 策略表征, 深度强化学习, 泛化能力, 多任务学习

Abstract:

Deep reinforcement learning (DRL) has achieved remarkable success in various domains. Nevertheless, existing policy networks in DRL still face significant challenges in areas such as generalizability, multi-task adaptability, and sample efficiency. Policy representation, as a crucial research direction for enhancing DRL capabilities, aims to improve an agent's adaptability to environmental changes and novel tasks by constructing more efficient and generalizable forms of policy expression. This paper provided a concise overview of key research advances in the field of policy representation. It introduced diverse policy architectures, ranging from traditional multi-layer perceptron (MLP)-based policies to those based on pointer networks, sequence generation models, diffusion models, hypernetworks, modular designs, mixture of experts models, and cross-modal policies based on serialized tokens. The paper sorted out cutting-edge research concerning policy representation methods, specifically addressing how semantic information within policy inputs and intermediate representations is encoded and optimized. It concluded with a summary and discussed prospects for future development.

Key words: policy representation, deep reinforcement learning, generalizability, multi-task learn

中图分类号: