Simulation of Robotic Arm Ball-catching Strategy Based on Curriculum RL of Transformer

doi:10.16182/j.issn1004731x.joss.25-0768

Abstract

Abstract:

method integrating the PPO algorithm with Transformer network architecture is proposed, and curriculum learning strategy is introduced to solve the difficult training convergence and low efficiency of traditional RL methods in complex and dynamic high-degree-of-freedom tasks such as robotic arm ball catching. The Transformer is employed to effectively capture the complex high-dimensional dependency between the robotic arm's state space, ball trajectory, and environmental physical parameters. Curriculum learning progressively increases catching difficulty by designing training tasks from simple to complex objectives. The experimental results show this method increases the ball-catching success rate by over 60% compared to the traditional PPO and features excellent accuracy at tracking balls with real-world disturbance characteristics. This method not only enhances the performance and efficiency of dynamic catching for robotic arms in both simulated and real-world disturbance conditions, but also provides a novel solution for complex task control in real-world scenarios.

Key words: RL, curriculum learning, Transformer, robotic arm, ball-catching control

CLC Number:

TP391.9

Zhang Ziyao, Ji Yunfeng. Simulation of Robotic Arm Ball-catching Strategy Based on Curriculum RL of Transformer[J]. Journal of System Simulation, 2026, 38(2): 321-331.

Figures/Tables 12

Fig. 1

Table 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Table 2

Fig. 6

Table 3

Fig. 7

Fig. 8

Fig. 9

References 30

[1]	Bombile Michael, Billard Aude. Bimanual Dynamic Grabbing and Tossing of Objects onto a Moving Target[J]. Robotics and Autonomous Systems, 2023, 167: 104481.
[2]	Mao Xiaofeng, Xu Yucheng, Wen Ruoshi, et al. Efficient Tactile Sensing-based Learning from Limited Real-world Demonstrations for Dual-arm Fine Pinch-grasp Skills[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2024: 5112-5119.
[3]	Delp S L, Anderson F C, Arnold A S, et al. OpenSim: Open-source Software to Create and Analyze Dynamic Simulations of Movement[J]. IEEE Transactions on Biomedical Engineering, 2007, 54(11): 1940-1950.
[4]	Li Chengxi, Zheng Pai, Yin Yue, et al. An AR-assisted Deep Reinforcement Learning-based Approach Towards Mutual-cognitive Safe Human-robot Interaction[J]. Robotics and Computer-Integrated Manufacturing, 2023, 80: 102471.
[5]	Gold Tobias, Völz Andreas, Graichen Knut. Model Predictive Interaction Control for Industrial Robots[J]. IFAC-PapersOnLine, 2020, 53(2): 9891-9898.
[6]	Wu Changjie, Tang Xiaolong, Xu Xiaoyan. Model Predictive Controller Design Based on Residual Model Trained by Gaussian Process for Robots[J]. Journal of Marine Science and Engineering, 2023, 11(5): 893.
[7]	Ploeger Kai, Peters Jan. Controlling the Cascade: Kinematic Planning for N-ball Toss Juggling[C]//2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2022: 1139-1144.
[8]	Callado Tomás, Farooqi H, Gupta T, et al. Using Closed Feedback Loops to Evaluate Autonomous Juggling Performance[C]//2020 IEEE MIT Undergraduate Research Technology Conference (URTC). Piscataway: IEEE, 2020: 1-4.
[9]	Aşık Okan, Görer Binnur, Levent Akın H. End-to-end Deep Imitation Learning: Robot Soccer Case Study[C]//RoboCup 2018: Robot World Cup XXII. Cham: Springer International Publishing, 2019: 137-149.
[10]	Serra Diana, Ruggiero Fabio, Lippiello Vincenzo, et al. A Nonlinear Least Squares Approach for Nonprehensile Dual-hand Robotic Ball Juggling[J]. IFAC-PapersOnLine, 2017, 50(1): 11485-11490.
[11]	Georg Rudolf Sebastian Bätz. Planning and Control Methods for Robotic Manipulation Tasks with Non-negligible Dynamics[D]. München: Technische Universität München, 2011.
[12]	Wang Jiwu, Xu Junxiang. Kinematic Modeling and Simulation of Dual-arm Robot[J]. Journal of Robotics, Networking and Artificial Life, 2021, 8(1): 56-59.
[13]	Tusset Angelo M, Amarildo E B Pereira, Balthazar Jose M, et al. Positioning Control of Robotic Manipulators Subject to Excitation from Non-ideal Sources[J]. Robotics, 2023, 12(2): 51.
[14]	祁若龙, 张珂, 周维佳, 等. 机械臂高斯运动轨迹规划及操作成功概率预估计方法[J]. 机械工程学报, 2019, 55(1): 42-51.
	Qi Ruolong, Zhang Ke, Zhou Weijia, et al. Trajectory Planning and Success Probability Estimation of Operation for Gaussian Motion Manipulators[J]. Journal of Mechanical Engineering, 2019, 55(1): 42-51.
[15]	Dler Salih Hasanc, Nazhad Ahmad Husseinb, SerwerYouns Sara. Kinematic Workspace Modelling of Two Links Robotic Manipulator[J]. Anbar Journal of Engineering Science, 2020, 11(1): 19-24.
[16]	Yang Shibao, Liu Pengcheng, Pears N. Benchmarking of Robot Arm Motion Planning in Cluttered Environments[C]//2023 28th International Conference on Automation and Computing (ICAC). Piscataway: IEEE, 2023: 1-6.
[17]	Mario Gomez Andreu, Ploeger Kai, Peters Jan. Beyond the Cascade: Juggling Vanilla Siteswap Patterns[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE, 2024: 2928-2934.
[18]	Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-08-28) [2025-06-08]. .
[19]	Lindner Tymoteusz, Milecki Andrzej. Reinforcement Learning-based Algorithm to Avoid Obstacles by the Anthropomorphic Robotic Arm[J]. Applied Sciences, 2022, 12(13): 6629.
[20]	Chamorro Simon, Klemm Victor, Miguel de La Iglesia Valls, et al. Reinforcement Learning for Blind Stair Climbing with Legged and Wheeled-legged Robots[C]//2024 IEEE International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 2024: 8081-8087.
[21]	Schulman J, Levine S, Abbeel P, et al. Trust Region Policy Optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2015: 1889-1897.
[22]	Miller A J, Fahmi S, Chignoli M, et al. Reinforcement Learning for Legged Robots: Motion Imitation from Model-based Optimal Control[EB/OL]. (2023-05-18) [2025-06-08]. .
[23]	Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with Deep Reinforcement Learning[EB/OL]. (2013-12-19) [2025-06-08]. .
[24]	Haarnoja T, Zhou A, Abbeel P, et al. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2018: 1861-1870.
[25]	Wong C C, Chien S Y, Feng H M, et al. Motion Planning for Dual-arm Robot Based on Soft Actor-critic[J]. IEEE Access, 2021, 9: 26871-26885.
[26]	Chen Yuanpei, Wu Tianhao, Wang Shengjie, et al. Towards Human-level Bimanual Dexterous Manipulation with Reinforcement Learning[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 5150-5163.
[27]	Hu Xiaoyi, Mao Yue, Wang Gang, et al. Catching Spinning Table Tennis Balls in Simulation with End-to-end Curriculum Reinforcement Learning[J]. Engineering Applications of Artificial Intelligence, 2025, 158, Part A: 111285.
[28]	Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[29]	Bengio Yoshua, Louradour Jérôme, Collobert R, et al. Curriculum Learning[C]//Proceedings of the 26th annual international conference on machine learning. New York: Association for Computing Machinery, 2009: 41-48.
[30]	Graves Alex. Long Short-term Memory[C]//Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer Berlin Heidelberg, 2012: 37-45.

关节	旋转轴	旋转上下限/(°)	速度上下限/((°)/s)
J1	y	90, -60	120, 0
J2	x	50, -15	80, 0
J3	z	180, 0	80, 0
J4	x	105, -90	80, 0
J5	z	179, -179	80, 0
J6	x	100, -100	80, 0

参数名称	定义	取值
batch_size	缓存池采样用于更新模型的样本数量	256
buffer_size	缓存池里面的样本数量	4 096
learning_rate	学习率	0.000 1
beta	策略熵正则化稀疏	0.02
epsilon	剪切范围系数	0.2
hidden_units	隐藏单元	256
max_steps	训练次数	30×10⁵

算法	均值±标准差	95%置信区间
PPO-LSTM	30.2±4.5	[27.1, 33.3]
课程学习PPO-LSTM	60.5±5.2	[57.2, 63.8]
课程学习PPO-Transformer	90.3±3.8	[88.0, 92.6]