基于Transformer课程RL的机械臂接球策略仿真研究

doi:10.16182/j.issn1004731x.joss.25-0768

摘要/Abstract

摘要：

针对机械臂接球等高自由度复杂动态任务中传统RL方法训练收敛难、效率低的问题，提出一种融合PPO算法与Transformer网络架构，并引入课程学习策略。利用Transformer有效捕捉机械臂状态空间、球体运动轨迹和环境物理参数间的高维复杂依赖关系；课程学习从简到难设计训练任务目标，逐步提升捕捉难度。实验结果表明：同等条件下比传统PPO接球成功率提升60%以上，对真实扰动特征的小球轨迹捕捉精度优异，不仅提升了在模拟和现实扰动条件下机械臂动态捕捉的性能与效率，也为真实场景复杂任务控制提供新途径。

关键词: 强化学习, 课程学习, Transformer, 机械臂, 接球控制

Abstract:

method integrating the PPO algorithm with Transformer network architecture is proposed, and curriculum learning strategy is introduced to solve the difficult training convergence and low efficiency of traditional RL methods in complex and dynamic high-degree-of-freedom tasks such as robotic arm ball catching. The Transformer is employed to effectively capture the complex high-dimensional dependency between the robotic arm's state space, ball trajectory, and environmental physical parameters. Curriculum learning progressively increases catching difficulty by designing training tasks from simple to complex objectives. The experimental results show this method increases the ball-catching success rate by over 60% compared to the traditional PPO and features excellent accuracy at tracking balls with real-world disturbance characteristics. This method not only enhances the performance and efficiency of dynamic catching for robotic arms in both simulated and real-world disturbance conditions, but also provides a novel solution for complex task control in real-world scenarios.

Key words: RL, curriculum learning, Transformer, robotic arm, ball-catching control

中图分类号:

TP391.9

章子瑶,季云峰 . 基于Transformer课程RL的机械臂接球策略仿真研究[J]. 系统仿真学报, 2026, 38(2): 321-331.

Zhang Ziyao,Ji Yunfeng . Simulation of Robotic Arm Ball-catching Strategy Based on Curriculum RL of Transformer[J]. Journal of System Simulation, 2026, 38(2): 321-331.

图/表 12

图1

表1

图2

图3

图4

图5

表2

图6

表3

图7

图8

图9

参考文献 30

[1]	Bombile Michael, Billard Aude. Bimanual Dynamic Grabbing and Tossing of Objects onto a Moving Target[J]. Robotics and Autonomous Systems, 2023, 167: 104481.
[2]	Mao Xiaofeng, Xu Yucheng, Wen Ruoshi, et al. Efficient Tactile Sensing-based Learning from Limited Real-world Demonstrations for Dual-arm Fine Pinch-grasp Skills[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2024: 5112-5119.
[3]	Delp S L, Anderson F C, Arnold A S, et al. OpenSim: Open-source Software to Create and Analyze Dynamic Simulations of Movement[J]. IEEE Transactions on Biomedical Engineering, 2007, 54(11): 1940-1950.
[4]	Li Chengxi, Zheng Pai, Yin Yue, et al. An AR-assisted Deep Reinforcement Learning-based Approach Towards Mutual-cognitive Safe Human-robot Interaction[J]. Robotics and Computer-Integrated Manufacturing, 2023, 80: 102471.
[5]	Gold Tobias, Völz Andreas, Graichen Knut. Model Predictive Interaction Control for Industrial Robots[J]. IFAC-PapersOnLine, 2020, 53(2): 9891-9898.
[6]	Wu Changjie, Tang Xiaolong, Xu Xiaoyan. Model Predictive Controller Design Based on Residual Model Trained by Gaussian Process for Robots[J]. Journal of Marine Science and Engineering, 2023, 11(5): 893.
[7]	Ploeger Kai, Peters Jan. Controlling the Cascade: Kinematic Planning for N-ball Toss Juggling[C]//2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2022: 1139-1144.
[8]	Callado Tomás, Farooqi H, Gupta T, et al. Using Closed Feedback Loops to Evaluate Autonomous Juggling Performance[C]//2020 IEEE MIT Undergraduate Research Technology Conference (URTC). Piscataway: IEEE, 2020: 1-4.
[9]	Aşık Okan, Görer Binnur, Levent Akın H. End-to-end Deep Imitation Learning: Robot Soccer Case Study[C]//RoboCup 2018: Robot World Cup XXII. Cham: Springer International Publishing, 2019: 137-149.
[10]	Serra Diana, Ruggiero Fabio, Lippiello Vincenzo, et al. A Nonlinear Least Squares Approach for Nonprehensile Dual-hand Robotic Ball Juggling[J]. IFAC-PapersOnLine, 2017, 50(1): 11485-11490.
[11]	Georg Rudolf Sebastian Bätz. Planning and Control Methods for Robotic Manipulation Tasks with Non-negligible Dynamics[D]. München: Technische Universität München, 2011.
[12]	Wang Jiwu, Xu Junxiang. Kinematic Modeling and Simulation of Dual-arm Robot[J]. Journal of Robotics, Networking and Artificial Life, 2021, 8(1): 56-59.
[13]	Tusset Angelo M, Amarildo E B Pereira, Balthazar Jose M, et al. Positioning Control of Robotic Manipulators Subject to Excitation from Non-ideal Sources[J]. Robotics, 2023, 12(2): 51.
[14]	祁若龙, 张珂, 周维佳, 等. 机械臂高斯运动轨迹规划及操作成功概率预估计方法[J]. 机械工程学报, 2019, 55(1): 42-51.
	Qi Ruolong, Zhang Ke, Zhou Weijia, et al. Trajectory Planning and Success Probability Estimation of Operation for Gaussian Motion Manipulators[J]. Journal of Mechanical Engineering, 2019, 55(1): 42-51.
[15]	Dler Salih Hasanc, Nazhad Ahmad Husseinb, SerwerYouns Sara. Kinematic Workspace Modelling of Two Links Robotic Manipulator[J]. Anbar Journal of Engineering Science, 2020, 11(1): 19-24.
[16]	Yang Shibao, Liu Pengcheng, Pears N. Benchmarking of Robot Arm Motion Planning in Cluttered Environments[C]//2023 28th International Conference on Automation and Computing (ICAC). Piscataway: IEEE, 2023: 1-6.
[17]	Mario Gomez Andreu, Ploeger Kai, Peters Jan. Beyond the Cascade: Juggling Vanilla Siteswap Patterns[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2024: 2928-2934.
[18]	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-08-28) [2025-06-08]. .
[19]	Lindner Tymoteusz, Milecki Andrzej. Reinforcement Learning-based Algorithm to Avoid Obstacles by the Anthropomorphic Robotic Arm[J]. Applied Sciences, 2022, 12(13): 6629.
[20]	Chamorro Simon, Klemm Victor, Miguel de La Iglesia Valls, et al. Reinforcement Learning for Blind Stair Climbing with Legged and Wheeled-legged Robots[C]//2024 IEEE International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 2024: 8081-8087.
[21]	Schulman J, Levine S, Abbeel P, et al. Trust Region Policy Optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2015: 1889-1897.
[22]	Miller A J, Fahmi S, Chignoli M, et al. Reinforcement Learning for Legged Robots: Motion Imitation from Model-based Optimal Control[EB/OL]. (2023-05-18) [2025-06-08]. .
[23]	Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with Deep Reinforcement Learning[EB/OL]. (2013-12-19) [2025-06-08]. .
[24]	Haarnoja T, Zhou A, Abbeel P, et al. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1861-1870.
[25]	Wong C C, Chien S Y, Feng H M, et al. Motion Planning for Dual-arm Robot Based on Soft Actor-critic[J]. IEEE Access, 2021, 9: 26871-26885.
[26]	Chen Yuanpei, Wu Tianhao, Wang Shengjie, et al. Towards Human-level Bimanual Dexterous Manipulation with Reinforcement Learning[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 5150-5163.
[27]	Hu Xiaoyi, Mao Yue, Wang Gang, et al. Catching Spinning Table Tennis Balls in Simulation with End-to-end Curriculum Reinforcement Learning[J]. Engineering Applications of Artificial Intelligence, 2025, 158, Part A: 111285.
[28]	Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[29]	Bengio Yoshua, Louradour Jérôme, Collobert R, et al. Curriculum Learning[C]//Proceedings of the 26th annual international conference on machine learning. New York: Association for Computing Machinery, 2009: 41-48.
[30]	Graves Alex. Long Short-term Memory[C]//Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer Berlin Heidelberg, 2012: 37-45.

关节	旋转轴	旋转上下限/(°)	速度上下限/((°)/s)
J1	y	90, -60	120, 0
J2	x	50, -15	80, 0
J3	z	180, 0	80, 0
J4	x	105, -90	80, 0
J5	z	179, -179	80, 0
J6	x	100, -100	80, 0

参数名称	定义	取值
batch_size	缓存池采样用于更新模型的样本数量	256
buffer_size	缓存池里面的样本数量	4 096
learning_rate	学习率	0.000 1
beta	策略熵正则化稀疏	0.02
epsilon	剪切范围系数	0.2
hidden_units	隐藏单元	256
max_steps	训练次数	30×10⁵

算法	均值±标准差	95%置信区间
PPO-LSTM	30.2±4.5	[27.1, 33.3]
课程学习PPO-LSTM	60.5±5.2	[57.2, 63.8]
课程学习PPO-Transformer	90.3±3.8	[88.0, 92.6]