Research on PAC-Bayes-Based A2C Algorithm for Multi-objective Reinforcement Learning

doi:10.16182/j.issn1004731x.joss.25-FZ0691

Abstract

Abstract:

To address the theoretical challenges of exploration and exploitation trade-offs and uncertainty modeling in multi-objective reinforcement learning (MORL), this study developed a learning framework, MO-PAC, based on PAC-Bayes theory. By introducing a multi-objective stochastic Critic network and a dynamic preference mechanism, the framework extended the conventional A2C architecture, enabling adaptive and efficient approximation of complex Pareto fronts. Experimental results demonstrate that in multi-objective MuJoCo environments, MO-PAC outperforms baseline algorithms, achieving approximately 20% improvement in hypervolume and 60% increase in expected utility, while exhibiting superior convergence efficiency and robustness. It verifies both theoretical value and practical performance advantages in multi-objective decision-making. The MO-PAC framework overcomes the theoretical limitations of existing methods in dynamic trade-off and uncertainty modeling, providing a novel methodological foundation for advancing the MORL theoretical framework.

Key words: multi-objective optimization, multi-objective reinforcement learning, actor-critic algorithm, uncertainty, PAC-Bayes theory

CLC Number:

TP391.9

Liu Xiang, Jin Qiankun. Research on PAC-Bayes-Based A2C Algorithm for Multi-objective Reinforcement Learning[J]. Journal of System Simulation, 2025, 37(12): 3212-3223.

Figures/Tables 7

Fig. 1

Table 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

References 24

[1]	Li Yuxi. Deep Reinforcement Learning: An Overview[EB/OL]. (2017-01-25)[2025-04-29]. .
[2]	Mossalam H, Assael Y M, Roijers D M, et al. Multi-objective Deep Reinforcement Learning[EB/OL]. (2016-10-09)[2025-04-29]. .
[3]	Watkins C J C H, Dayan P. Q-learning[J]. Machine Learning, 1992, 8(3): 279-292.
[4]	Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with Deep Reinforcement Learning[EB/OL]. (2013-12-19)[2025-04-29]. .
[5]	Lockwood O, Si Mei. A Review of Uncertainty for Deep Reinforcement Learning[C]//Proceedings of the Eighteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. Palo Alto, CA, USA: AAAI Press, 2022: 155-162.
[6]	Ghavamzadeh Mohammad, Mannor Shie, Pineau Joelle, et al. Bayesian Reinforcement Learning: A Survey[J]. Foundations and Trends® in Machine Learning, 2015, 8(5/6): 359-483.
[7]	Van Moffaert Kristof, Drugan Madalina M, Nowé Ann. Scalarized Multi-objective Reinforcement Learning: Novel Design Techniques[C]//2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). Piscataway: IEEE, 2013: 191-199.
[8]	Ren Milong, He Zaikai, Zhang Haicang. Multi-objective Antibody Design with Constrained Preference Optimization[C]//ICLR 2025 Conference. New York: ICLR, 2025: 1-29.
[9]	Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous Control with Deep Reinforcement Learning[C]//ICLR 2016. New York: ICLR, 2016: 1-14.
[10]	Fujimoto S, Hoof H, Meger D. Addressing Function Approximation Error in Actor-critic Methods[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1587-1596.
[11]	Haarnoja T, Zhou A, Abbeel P, et al. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1861-1870.
[12]	Tasdighi B, Akgül Abdullah, Haussmann M, et al. PAC-bayesian Soft Actor-critic Learning[C]//Proceedings of the 6th Symposium on Advances in Approximate Bayesian Inference. Chia Laguna Resort: PMLR, 2024: 127-145.
[13]	Ciosek K, Vuong Q, Loftin R, et al. Better Exploration with Optimistic Actor-critic[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 1787-1798.
[14]	Zhang Yu, Cai Peixiang, Pan Changyong, et al. Multi-agent Deep Reinforcement Learning-based Cooperative Spectrum Sensing with Upper Confidence Bound Exploration[J]. IEEE Access, 2019, 7: 118898-118906.
[15]	Shi Yucheng, Lynch David, Agapitos Alexandros. UCB-driven Utility Function Search for Multi-objective Reinforcement Learning[C]//Machine Learning and Knowledge Discovery in Databases. Research Track. Cham: Springer Nature Switzerland, 2026: 163-178.
[16]	Agarwal M, Aggarwal V, Lan Tian. Multi-objective Reinforcement Learning with Non-linear Scalarization[C]//Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. Richland: International Foundation for Autonomous Agents and Multiagent Systems, 2022: 9-17.
[17]	Van Moffaert Kristof, Drugan Madalina M, Nowé Ann. Hypervolume-based Multi-objective Reinforcement Learning[C]//Evolutionary Multi-Criterion Optimization. Berlin: Springer Berlin Heidelberg, 2013: 352-366.
[18]	Qiu Shuang, Zhang Dake, Yang Rui, et al. Traversing Pareto Optimal Policies: Provably Efficient Multi-objective Reinforcement Learning[EB/OL]. (2024-07-24)[2025-04-29]. .
[19]	Tasdighi B, Werge N, Wu Yishan, et al. Probabilistic Actor-critic: Learning to Explore with PAC-bayes Uncertainty[EB/OL]. (2024-02-05)[2025-04-29]. .
[20]	Basaklar T, Gumussoy S, Ogras U Y. PD-MORL: Preference-driven Multi-objective Reinforcement Learning Algorithm[C]. (2022-08-16)[2025-04-29]. .
[21]	Felten Florian, Alegre Lucas N, Nowé Ann, et al. A Toolkit for Reliable Benchmarking and Research in Multi-objective Reinforcement Learning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 23671-23700.
[22]	Xu Jie, Tian Yunsheng, Ma Pingchuan, et al. Prediction-guided Multi-objective Reinforcement Learning for Continuous Robot Control[C]//Proceedings of the 37th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2020: 10607-10616.
[23]	Alegre Lucas N, L C Bazzan Ana, Roijers Diederik M, et al. Sample-efficient Multi-objective Learning via Generalized Policy Improvement Prioritization[C]//Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. Richland: International Foundation for Autonomous Agents and Multiagent Systems, 2023: 2003-2012.
[24]	Lu Haoye, Herman D, Yu Yaoliang. Multi-objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality[C]//ICLR 2023. New York: ICLR, 2023: 1-27.

环境	状态空间维度	动作空间维度	奖励空间维度
MO-HalfCheetah	11	3	2
MO-Hopper-2d	17	6	2
MO-Hopper	17	6	3