深度强化学习中策略表征研究简述

doi:10.16182/j.issn1004731x.joss.25-0533

摘要/Abstract

摘要：

深度强化学习(deep reinforcement learning，DRL)在多个领域取得了显著成功，但DRL的策略网络在泛化性、多任务适应性和样本效率等方面仍面临巨大挑战。策略表征作为提升DRL能力的重要研究方向，通过构建更高效、更泛化的策略表达形式，提升智能体对环境变化及新任务的适应能力。概述了策略表征领域的关键研究进展，介绍了从传统的基于多层感知机(multi-layer perceptron，MLP)策略到基于指针网络、序列生成模型、扩散模型、超网络、模块化设计以及专家混合模型以及基于序列化Token的跨模态策略等多样化策略架构，还从策略输入和中间表达的语义如何编码和优化等策略表征方法层面归纳分析前沿研究。总结并对未来可能的发展趋势进行了展望。

关键词: 策略表征, 深度强化学习, 泛化能力, 多任务学习

Abstract:

Deep reinforcement learning (DRL) has achieved remarkable success in various domains. Nevertheless, existing policy networks in DRL still face significant challenges in areas such as generalizability, multi-task adaptability, and sample efficiency. Policy representation, as a crucial research direction for enhancing DRL capabilities, aims to improve an agent's adaptability to environmental changes and novel tasks by constructing more efficient and generalizable forms of policy expression. This paper provided a concise overview of key research advances in the field of policy representation. It introduced diverse policy architectures, ranging from traditional multi-layer perceptron (MLP)-based policies to those based on pointer networks, sequence generation models, diffusion models, hypernetworks, modular designs, mixture of experts models, and cross-modal policies based on serialized tokens. The paper sorted out cutting-edge research concerning policy representation methods, specifically addressing how semantic information within policy inputs and intermediate representations is encoded and optimized. It concluded with a summary and discussed prospects for future development.

Key words: policy representation, deep reinforcement learning, generalizability, multi-task learn

中图分类号:

TP391.9

陈真,吴卓屹,张霖 . 深度强化学习中策略表征研究简述[J]. 系统仿真学报, 2025, 37(7): 1753-1769.

Chen Zhen,Wu Zhuoyi,Zhang Lin . Research on Policy Representation in Deep Reinforcement Learning[J]. Journal of System Simulation, 2025, 37(7): 1753-1769.

图/表 1

参考文献 110

[1]	刘朝阳, 穆朝絮, 孙长银. 深度强化学习算法与应用研究现状综述[J]. 智能科学与技术学报, 2020, 2(4): 312-326.
	Liu Zhaoyang, Mu Zhaoxu, Sun Changyin. An Overview on Algorithms and Applications of Deep Reinforcement Learning[J]. Chinese Journal of Intelligent Science and Technology, 2020, 2(4): 312-326.
[2]	李静, 丁佳文, 沈南燕, 等. 基于深度强化学习的双足机器人行走策略研究[J]. 机器人技术与应用, 2025(3): 44-49.
[3]	Li Minne, Wu Lisheng, Wang Jun, et al. Multi-view Reinforcement Learning[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 1-12.
[4]	Silver D, Hubert T, Schrittwieser J, et al. A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-play[J]. Science, 2018, 362(6419): 1140-1144.
[5]	Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster Level in StarCraft II Using Multi-agent Reinforcement Learning[J]. Nature, 2019, 575(7782): 350-354.
[6]	Stiennon N, Ouyang Long, Wu J, et al. Learning to Summarize from Human Feedback[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 3008-3021.
[7]	Cobbe K, Klimov O, Hesse C, et al. Quantifying Generalization in Reinforcement Learning[C]//Proceedings of the 36th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2019: 1282-1289
[8]	Mazoure B, Doan T, Li Tianyu, et al. Low-rank Representation of Reinforcement Learning Policies[J]. Journal of Artificial Intelligence Research, 2022, 75: 597-636.
[9]	Nabati Ofir, Tennenholtz Guy, Mannor Shie. Representation-driven Reinforcement Learning[C]//Proceedings of the 40th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2023: 25588-25603.
[10]	Yang Long, Huang Zhixiong, Lei Fenghao, et al. Policy Representation via Diffusion Probability Model for Reinforcement Learning[EB/OL]. (2023-05-22) [2025-06-01]. .
[11]	Levine S, Pastor P, Krizhevsky A, et al. Learning Hand-eye Coordination for Robotic Grasping with Deep Learning and Large-scale Data Collection[J]. The International Journal of Robotics Research, 2018, 37(4/5): 421-436.
[12]	Jared Di Carlo, Wensing P M, Katz B, et al. Dynamic Locomotion in the MIT Cheetah 3 Through Convex Model-predictive Control[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2018: 1-9.
[13]	Mock J W, Muknahallipatna S S. Sim-to-real: A Performance Comparison of PPO, TD3, and SAC Reinforcement Learning Algorithms for Quadruped Walking Gait Generation[J]. Journal of Intelligent Learning Systems and Applications, 2024, 16(2): 23-43.
[14]	Kaufmann T, Weng P, Bengs V, et al. A Survey of Reinforcement Learning from Human Feedback[EB/OL]. (2024-04-30) [2025-06-01]. .
[15]	Welsby P, Cheung B M Y. ChatGPT[J]. Postgraduate Medical Journal, 2023, 99(1176): 1047-1048.
[16]	Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI Feedback[EB/OL]. (2022-12-15) [2025-06-01]. .
[17]	Hessel M, Soyer H, Espeholt L, et al. Multi-task Deep Reinforcement Learning with PopArt[C]//Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2019: 3796-3803.
[18]	Fakoor R, Chaudhari P, Soatto S, et al. Meta-Q-Learning[EB/OL]. (2020-04-04) [2025-06-01]. .
[19]	Liu Jinxin, Wang Donglin, Tian Qiangxing, et al. Learn Goal-conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning[C]//Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence and Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence and the Twelveth Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2022: 7558-7566.
[20]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level Control Through Deep Reinforcement Learning[J]. Nature, 2015, 518(7540): 529-533.
[21]	Lillicrap P T, Hunt J J, Pritzel A, et al. Continuous Control with Deep Reinforcement Learning[EB/OL]. (2019-07-05) [2025-06-01]. .
[22]	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-08-28) [2025-06-01]. .
[23]	Haarnoja T, Zhou A, Abbeel P, et al. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1861-1870.
[24]	Vinyals O, Fortunato M, Jaitly N. Pointer Networks[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2015: 1-9.
[25]	Bello I, Pham H, Le Q V, et al. Neural Combinatorial Optimization with Reinforcement Learning[EB/OL]. (2017-01-12) [2025-06-01]. .
[26]	Kool W, van Hoof Herke, Welling M. Attention, Learn to Solve Routing Problems![EB/OL]. (2019-02-07) [2025-06-01]. .
[27]	Nazari M, Oroojlooy A, Snyder L, et al. Reinforcement Learning for Solving the Vehicle Routing Problem[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1-11.
[28]	Sudhakar R V, Dastagiraiah C, Pattem S, et al. Multi-objective Reinforcement Learning Based Algorithm for Dynamic Workflow Scheduling in Cloud Computing[J]. Indonesian Journal of Electrical Engineering and Informatics, 2024, 12(3): 640-649.
[29]	Li Wei, Li Ruxuan, Ma Yuzhe, et al. Rethinking Graph Neural Networks for the Graph Coloring Problem[EB/OL]. (2022-08-19) [2025-06-01]. .
[30]	Chen Lili, Lu K, Rajeswaran A, et al. Decision Transformer: Reinforcement Learning via Sequence Modeling[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 15084-15097.
[31]	Janner M, Li Qiyang, Levine S. Offline Reinforcement Learning as One Big Sequence Modeling Problem[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1273-1286.
[32]	Xu Mengdi, Shen Yikang, Zhang Shun, et al. Prompting Decision Transformer for Few-shot Policy Generalization[C]//Proceedings of the 39th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2022: 24631-24645.
[33]	Wang Zhendong, Hunt J J, Zhou Mingyuan. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning[EB/OL]. (2023-08-25) [2025-06-01]. .
[34]	Chen Huayu, Lu Cheng, Ying Chengyang, et al. Offline Reinforcement Learning via High-fidelity Generative Behavior Modeling[EB/OL]. (2023-02-28) [2025-06-01]. .
[35]	Lu Cheng, Chen Huayu, Chen Jianfei, et al. Contrastive Energy Prediction for Exact Energy-guided Diffusion Sampling in Offline Reinforcement Learning[C]//Proceedings of the 40th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2023: 22825-22855.
[36]	Kang Bingyi, Ma Xiao, Du Chao, et al. Efficient Diffusion Policies for Offline Reinforcement Learning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 67195-67212.
[37]	Coleman M, Russakovsky O, Allen-Blanchette C, et al. Discrete Diffusion Reward Guidance Methods for Offline Reinforcement Learning[C]//ICML 2023 Workshop: Sampling and Optimization in Discrete Space. San Diego: ICML, 2023: 1-9.
[38]	Qiao Ruixi, Cheng Jie, Dai Xingyuan, et al. Offline Reinforcement Learning with Discrete Diffusion Skills[EB/OL]. (2025-03-26) [2025-06-01]. .
[39]	Ha D, Dai A, Le Q V. Hypernetworks[EB/OL]. (2016-12-01) [2025-06-01]. .
[40]	Johannes von Oswald, Henning C, Grewe B F, et al. Continual Learning with Hypernetworks[EB/OL]. (2022-04-11) [2025-06-01]. .
[41]	Zhao D, Kobayashi S, Sacramento João, et al. Meta-learning Via Hypernetworks[C]//4th Workshop on Meta-Learning at NeurIPS 2020 (MetaLearn 2020). Piscataway: IEEE, 2020: 1-8.
[42]	Rashid T, Samvelyan M, Christian Schroeder De Witt, et al. Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning[J]. The Journal of Machine Learning Research, 2020, 21(1): 7234-7284.
[43]	Iqbal S, Christian A Schroeder De Witt, Peng Bei, et al. Randomized Entity-wise Factorization for Multi-agent Reinforcement Learning[C]//Proceedings of the 38th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2021: 4596-4606.
[44]	Hegde S, Huang Zhehui, Sukhatme G S. HyperPPO: A Scalable Method for Finding Small Policies for Robotic Control[C]//2024 IEEE International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 2024: 10821-10828.
[45]	Huang Yizhou, Xie K, Bharadhwaj H, et al. Continual Model-Based Reinforcement Learning with Hypernetworks[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 2021: 799-805.
[46]	Faccio Francesco, Herrmann Vincent, Ramesh Aditya, et al. Goal-conditioned Generators of Deep Policies[C]//Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2023: 7503-7511.
[47]	Rezaei-Shoshtari Sahand, Morissette Charlotte, Hogan Francois R, et al. Hypernetworks for Zero-shot Transfer in Reinforcement Learning[C]//Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2023: 9579-9587.
[48]	Graffeuille O, Koh Y S, Wicker Jörg, et al. Multi-task Learning with Hypernetworks and Task Metadata[C]//ICLR 2024 Conference. New York: ICLR, 2024: 1-18.
[49]	Chen Tao, Murali A, Gupta A. Hardware Conditioned Policies for Multi-robot Transfer Learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 9355-9366.
[50]	Schaff C, Yunis D, Chakrabarti A, et al. Jointly Learning to Construct and Control Agents Using Deep Reinforcement Learning[C]//2019 International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 2019: 9798-9805.
[51]	Pathak D, Lu C, Darrell T, et al. Learning to Control Self-assembling Morphologies: A Study of Generalization Via Modularity[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 2295-2305.
[52]	Wang Tingwu, Liao Renjie, Ba J, et al. NerveNet: Learning Structured Policy with Graph Neural Networks[C]//ICLR 2018 Conference. New York: ICLR, 2018: 1-26.
[53]	Huang Wenlong, Mordatch I, Pathak D. One Policy to Control Them All: Shared Modular Policies for Agent-agnostic Control[C]//Proceedings of the 37th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2020: 4455-4464.
[54]	Whitman J, Travers M, Choset H. Learning Modular Robot Control Policies[J]. IEEE Transactions on Robotics, 2023, 39(5): 4095-4113.
[55]	Nurbek G. Exploring Graph Neural Networks in Reinforcement Learning: A Comparative Study on Architectures for Locomotion Tasks[D]. Edinburg: The University of Texas Rio Grande Valley, 2024.
[56]	Ren Jie, Li Yewen, Ding Zihan, et al. Probabilistic Mixture-of-experts for Efficient Deep Reinforcement Learning[EB/OL]. (2021-04-19) [2025-06-01]. .
[57]	Doya Kenji, Samejima Kazuyuki, Katagiri Ken-ichi, et al. Multiple Model-based Reinforcement Learning[J]. Neural Computation, 2002, 14(6): 1347-1369.
[58]	Samejima Kazuyuki, Doya Kenji, Kawato Mitsuo. Inter-module Credit Assignment in Modular Reinforcement Learning[J]. Neural Networks, 2003, 16(7): 985-994.
[59]	van Seijen Harm, Bakker Bram, Kester Leon. Switching Between Different State Representations in Reinforcement Learning[C]//Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications. USA: ACTA Press, 2008: 226-231.
[60]	Peng Xuecin, Berseth G, Michiel van de Panne. Terrain-adaptive Locomotion Skills Using Deep Reinforcement Learning[J]. ACM Transactions on Graphics, 2016, 35(4): 81.
[61]	Tommasino Paolo, Caligiore Daniele, Mirolli Marco, et al. A Reinforcement Learning Architecture That Transfers Knowledge Between Skills When Solving Multiple Tasks[J]. IEEE Transactions on Cognitive and Developmental Systems, 2019, 11(2): 292-317.
[62]	Gimelfarb M, Sanner S, Lee C G. Contextual Policy Transfer in Reinforcement Learning Domains via Deep Mixtures-of-experts[C]//Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence. Chia Laguna Resort: PMLR, 2021: 1787-1797.
[63]	Willi T, Obando-Ceron J, Foerster J, et al. Mixture of Experts in a Mixture of RL Settings[EB/OL]. (2024-06-26) [2025-06-01]. .
[64]	Vincze Mátyás, Ferrarotti Laura, Leonardo Lucio Custode, et al. SMoSE: Sparse Mixture of Shallow Experts for Interpretable Reinforcement Learning in Continuous Control Tasks[C]//Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2025: 20982-20990.
[65]	Celik O, Taranovic A, Neumann G. Acquiring Diverse Skills Using Curriculum Reinforcement Learning with Mixture of Experts[EB/OL]. (2024-06-10) [2025-06-01]. .
[66]	Peng Xuebin, Chang M, Zhang G, et al. MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 3686-3697.
[67]	Obando-Ceron J, Sokar G, Willi T, et al. Mixtures of Experts Unlock Parameter Scaling for Deep RL[EB/OL]. (2024-06-26) [2025-06-01]. .
[68]	Song Wenxuan, Zhao Han, Ding Pengxiang, et al. GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2024: 11879-11886.
[69]	Reed S, Zolna K, Parisotto E, et al. A Generalist Agent[EB/OL]. (2022-11-11) [2025-06-01]. .
[70]	Driess D, Xia Fei, Sajjadi Mehdi S M, et al. PaLM-E: An Embodied Multimodal Language Model[EB/OL]. (2023-03-06) [2025-06-01]. .
[71]	Brohan A, Brown N, Carbajal J, et al. RT-2: Vision-language-action Models Transfer Web Knowledge to Robotic Control[EB/OL]. (2023-07-28) [2025-06-01]. .
[72]	Mazzaglia P, Verbelen T, Dhoedt B, et al. GenRL: Multimodal-foundation World Models for Generalization in Embodied Agents[EB/OL]. (2024-10-30) [2025-06-01]. .
[73]	Liu Yuhang, Li Pengxiang, Wei Zishu, et al. InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection[EB/OL]. (2025-01-08) [2025-06-01]. .
[74]	Bousmalis K, Vezzani G, Rao D, et al. RoboCat: A Self-improving Generalist Agent for Robotic Manipulation[EB/OL]. (2023-12-22) [2025-06-01]. .
[75]	Jiang Yunfan, Gupta A, Zhang Zichen, et al. VIMA: General Robot Manipulation with Multimodal Prompts[EB/OL]. (2023-05-28) [2025-06-01]. .
[76]	Jones J, Mees O, Sferrazza C, et al. Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding[EB/OL]. (2025-01-14) [2025-06-01]. .
[77]	Sridhar S, Dutta S, Jayaraman D, et al. REGENT: A Retrieval-augmented Generalist Agent That Can Act In-context in New Environments[EB/OL]. (2025-02-24) [2025-06-01]. .
[78]	Duan Yan, Schulman J, Chen Xi, et al. RL²: Fast Reinforcement Learning via Slow Reinforcement Learning[EB/OL]. (2016-11-10) [2025-06-01]. .
[79]	Finn C, Abbeel P, Levine S. Model-agnostic Meta-learning for Fast Adaptation of Deep Networks[C]//Proceedings of the 34th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2017: 1126-1135.
[80]	Rakelly K, Zhou A, Finn C, et al. Efficient Off-policy Meta-reinforcement Learning via Probabilistic Context Variables[C]//Proceedings of the 36th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2019: 5331-5340.
[81]	Lee K, Seo Y, Lee S, et al. Context-aware Dynamics Model for Generalization in Model-based Reinforcement Learning[C]//Proceedings of the 37th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2020: 5757-5766.
[82]	Sodhani S, Zhang A, Pineau J. Multi-task Reinforcement Learning with Context-based Representations[C]//Proceedings of the 38th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2021: 9767-9779.
[83]	Wang J, Zhang J, Jiang H,et al. Offline Meta Reinforcement Learning with In-distribution Online Adaptation[EB/OL]. (2023-06-01). [2025-06-01].
[84]	Beck J, Vuorio R, Liu Zheran, et al. A Survey of Meta-reinforcement Learning[EB/OL]. (2023-01-19) [2025-06-01]. .
[85]	Hallak A, Dotan Di Castro, Mannor S. Contextual Markov Decision Processes[EB/OL]. (2015-02-08) [2025-06-01]. .
[86]	Choi J, Guo Y, Moczulski M, et al. Contingency-Aware Exploration in Reinforcement Learning[EB/OL]. (2019-05-04) [2025-06-01]. .
[87]	Lagos J, Lempiö Urho, Rahtu E. Evaluating Generalization in Contextual Reinforcement Learning[EB/OL]. (2023-04-03) [2025-06-01]. .
[88]	Lanz D, Seiler Jürgen, Jaskolka K, et al. Compression of Dynamic Medical CT Data Using Motion Compensated Wavelet Lifting with Denoised Update[EB/OL]. (2023-02-02) [2025-06-01]. .
[89]	Krishna K M. Continuous Deutsch Uncertainty Principle and Continuous Kraus Conjecture[EB/OL]. (2023-10-02) [2025-06-01]. .
[90]	Laskin M, Srinivas A, Abbeel P. CURL: Contrastive Unsupervised Representations for Reinforcement Learning[C]//Proceedings of the 37th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2020: 5639-5650.
[91]	Schwarzer M, Anand A, Goel R, et al. Data-efficient Reinforcement Learning with Self-predictive Representations[EB/OL]. (2021-05-20) [2025-06-01]. .
[92]	Fu Haotian, Tang Hongyao, Hao Jianye, et al. Towards Effective Context for Meta-reinforcement Learning: An Approach Based on Contrastive Learning[C]//Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence and the Thirty-Third Conference on Innovative Applications of Artificial Intelligence and the Eleventh Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2021: 7457-7465.
[93]	McInroe T, Schäfer Lukas, Albrecht S V. Learning Temporally-consistent Representations for Data-efficient Reinforcement Learning[EB/OL]. (2021-10-11) [2025-06-01]. .
[94]	Wang B, Xu S, Keutzer K, et al. Improving Context-based Meta-reinforcement Learning with Self-supervised Trajectory Contrastive Learning[EB/OL]. (2021-03-10) [2025-06-01]. .
[95]	Eysenbach B, Zhang Tianjun, Levine S, et al. Contrastive Learning as Goal-conditioned Reinforcement Learning[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 35603-35620.
[96]	Schug S, Kobayashi S, Akram Y, et al. Discovering Modular Solutions that Generalize Compositionally[EB/OL]. (2024-03-25) [2025-06-01]. .
[97]	Goeckner A, Sui Yueyuan, Martinet N, et al. Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-robot Systems[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). [S.l. : IEEE, 2024: 5732-5739.
[98]	Zambaldi V, Raposo D, Santoro A, et al. Relational Deep Reinforcement learning[EB/OL]. (2018-06-28) [2025-06-01]. .
[99]	Shiarlis K, Wulfmeier M, Salter S, et al. TACO: Learning Task Decomposition via Temporal Alignment for Control[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 4654-4663.
[100]	Shu Tianmin, Xiong Caiming, Socher R. Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning[EB/OL]. (2017-12-20) [2025-06-01]. .
[101]	Lee Y, Yang Jingyun, Lim J J. Learning to Coordinate Manipulation Skills via Skill Behavior Diversification[EB/OL]. (2019-12-20). [2025-06-01]. .
[102]	Yuan Haoqi, Zhang Chi, Wang Hongcheng, et al. Skill Reinforcement Learning and Planning for Open-world Long-horizon Tasks[EB/OL]. (2023-12-04) [2025-06-01]. .
[103]	Wen Yongyan, Li Siyuan, Zuo Rongchang, et al. SkillTree: Explainable Skill-based Deep Reinforcement Learning for Long-horizon Control Tasks[C]//Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2025: 21491-21500.
[104]	Schaul T, Horgan D, Gregor K, et al. Universal Value Function Approximators[C]//Proceedings of the 32nd International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2015: 1312-1320.
[105]	Narasimhan K, Barzilay R, Jaakkola T. Grounding Language for Transfer in Deep Reinforcement Learning[J]. Journal of Artificial Intelligence Research, 2018, 63(1): 849-874.
[106]	Qian Zhifeng, You Mingyu, Zhou Hongjun, et al. Weakly Supervised Disentangled Representation for Goal-conditioned Reinforcement Learning[J]. IEEE Robotics and Automation Letters, 2022, 7(2): 2202-2209.
[107]	Peng Xuebin, Guo Yunrong, Halper L, et al. ASE: Large-scale Reusable Adversarial Skill Embeddings for Physically Simulated Characters[J]. ACM Transactions on Graphics, 2022, 41(4): 94.
[108]	Jackermeier M, Abate A. DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-task RL[EB/OL]. (2025-03-29) [2025-06-01]. .
[109]	Yalcinkaya B, Lauffer N, Vazquez-Chanlatte M, et al. Compositional Automata Embeddings for Goal-conditioned Reinforcement Learning[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 72933-72963.
[110]	Paliwal Y, Roy Rajarshi, Gaglione Jean-Raphaël, et al. Reinforcement Learning with Temporal-logic-based Causal Diagrams[C]//International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Cham: Springer Nature Switzerland, 2023: 123-140.

策略架构类型	输入结构	动作输出形式	典型适用场景
基于MLP的策略架构	状态向量(可拼接任务标签)	Softmax概率/高斯分布函数	简单任务、连续/离散控制问题
基于指针网络的策略架构	状态特征+动态元素序列(如城市、任务点)	指向输入元素的注意力索引	排序、路径规划、组合优化问题(TSP/ CVRP)
基于序列建模的策略架构	回报+状态+动作组成的轨迹序列(可附加Prompt)	自回归生成下一个动作	离线RL、条件策略建模、多任务控制
基于扩散过程的策略架构	状态特征+随机噪声向量+(可选)目标/奖励引导	反扩散采样生成的高维动作	多模态策略生成、Offline RL、复杂控制
超网络驱动的策略架构	状态嵌入+任务/上下文向量	由超网络生成的策略网络参数	多任务迁移、协作控制、零样本泛化
基于模块化结构的策略架构	局部状态+图结构连接信息(节点/边)	各模块局部策略输出的联合动作	多智能体系统、结构可变机器人
基于混合专家的策略架构	状态向量+上下文向量(可为任务或环境提示)	多专家动作的门控融合输出	多模态任务、非平稳策略集成、长期控制
基于序列化Token的策略架构	多模态信息统一为Token序列	序列模型自回归生成动作	跨模态任务、多任务泛化、统一策略学习与部署