Journal of System Simulation ›› 2026, Vol. 38 ›› Issue (1): 211-224.doi: 10.16182/j.issn1004731x.joss.25-0833
• Papers • Previous Articles Next Articles
Zhang Shuozhe1, Song Wenfeng1, Hou Xia1, Li Shuai2,3
Received:2025-09-02
Revised:2025-12-12
Online:2026-01-18
Published:2026-01-28
Contact:
Song Wenfeng
CLC Number:
Zhang Shuozhe, Song Wenfeng, Hou Xia, Li Shuai. Full-body Co-speech Gesture Generation Based on Spatial-temporal Enhanced Generation Model[J]. Journal of System Simulation, 2026, 38(1): 211-224.
Table 1
A comparison of various metrics of different algorithms on the BEIT test set
| 方法 | FGD↓ | BC→ | Div.↑ | MSE↓ | AITS↓ |
|---|---|---|---|---|---|
| CaMN[ | 6.64 | 6.763 | 10.861 | — | 0.132 |
| DiffStyleGesture[ | 8.81 | 7.335 | 11.492 | — | — |
| EMAGE[ | 5.43 | 7.853 | 13.121 | 0.122 | |
| Syntalker[ | 4.65 | 7.272 | 12.662 | — | 2.262 |
| MambaTalk[ | 5.36 | 7.932 | 13.030 | 6.845 | 0.065 |
| GestureLSM[ | 7.201 | 10.017 | 0.039 | ||
| 本文算法 | 3.94 | 13.782 | 8.092 |
| [1] | Liu Haiyang, Zhu Zihao, Iwamoto Naoya, et al. BEAT: A Large-scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis[C]//Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022: 612-630. |
| [2] | Liu Haiyang, Zhu Zihao, Becherini Giorgio, et al. EMAGE: Towards Unified Holistic Co-speech Gesture Generation via Expressive Masked Audio Gesture Modeling[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 1144-1154. |
| [3] | Qi Xingqun, Pan Jiahao, Li Peng, et al. Weakly-supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 10424-10434. |
| [4] | Xu Zunnan, Lin Yukang, Han Haonan, et al. MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 20055-20080. |
| [5] | Alexanderson Simon, Gustav Eje Henter, Kucherenko Taras, et al. Style-controllable Speech-driven Gesture Synthesis Using Normalising Flows[J]. Computer Graphics Forum, 2020, 39(2): 487-496. |
| [6] | Chen Bohong, Li Yumeng, Ding Yaoxiang, et al. Enabling Synergistic Full-body Control in Prompt-based Co-speech Motion Generation[C]//Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 6774-6783. |
| [7] | Yang Sicheng, Wu Zhiyong, Li Minglei, et al. DiffuseStyleGesture: Stylized Audio-driven Co-speech Gesture Generation with Diffusion Models[C]//Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California: IJCAI, 2023: 5860-5868. |
| [8] | Gu A, Dao T. Mamba: Linear-time Sequence Modeling with Selective State Spaces[EB/OL]. (2024-05-31) [2025-04-05]. . |
| [9] | Dao T, Gu A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality[C]//Proceedings of the 41st International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2024: 10041-10071. |
| [10] | Liu Pinxin, Song Luchuan, Huang Junhua, et al. GestureLSM: Latent Shortcut Based Co-speech Gesture Generation with Spatial-temporal Modeling[EB/OL]. (2025-01-31) [2025-04-05]. . |
| [11] | Zhang Mingyuan, Li Huirong, Cai Zhongang, et al. FineMoGen: Fine-grained Spatio-temporal Motion Generation and Editing[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 13981-13992. |
| [12] | Liu Xian, Wu Qianyi, Zhou Hang, et al. Learning Hierarchical Cross-modal Association for Co-speech Gesture Generation[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10452-10462. |
| [13] | Ao Tenglong, Gao Qingzhe, Lou Yuke, et al. Rhythmic Gesticulator: Rhythm-aware Co-speech Gesture Synthesis with Hierarchical Neural Embeddings[J]. ACM Transactions on Graphics, 2022, 41(6): 209. |
| [14] | Yi Hongwei, Liang Hualin, Liu Yifei, et al. Generating Holistic 3D Human Motion from Speech[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 469-480. |
| [15] | Hamza Mughal M, Dabral Rishabh, C J Scholman Merel, et al. Retrieving Semantics from the Deep: An RAG Solution for Gesture Synthesis[C]//2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2025: 16578-16588. |
| [16] | Chen Changan, Zhang Juze, Lakshmikanth S K, et al. The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion[EB/OL]. (2024-12-13) [2025-04-05]. . |
| [17] | Frans K, Hafner D, Levine S, et al. One Step Diffusion Via Shortcut Models[EB/OL]. (2024-10-16) [2025-08-12]. . |
| [18] | Lenz B, Lieber O, Arazi A, et al. Jamba: Hybrid Transformer-Mamba Language Models[C]//ICLR 2025 Conference. New York: ICLR, 2025: 1-26. |
| [19] | Wang Junxiong, Paliotta Daniele, May A, et al. The Mamba in the Llama: Distilling and Accelerating Hybrid Models[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 62432-62457. |
| [20] | Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and Efficient Foundation Language Models[EB/OL]. (2023-02-27) [2025-04-05]. . |
| [21] | Zhu Lianghui, Liao Bencheng, Zhang Qian, et al. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model[C]//Proceedings of the 41st International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2024: 62429-62442. |
| [22] | Liu Yue, Tian Yunjie, Zhao Yuzhong, et al. VMamba: Visual State Space Model[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 103031-103063. |
| [23] | Tao Hu Vincent, Stefan Andreas Baumann, Gui Ming, et al. ZigMa: A DiT-style Zigzag Mamba Diffusion Model[C]//Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025: 148-166. |
| [24] | Peebles W, Xie Saining. Scalable Diffusion Models with Transformers[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 4172-4182. |
| [25] | Shen Qiuhong, Wu Zike, Yi Xuanyu, et al. Gamba: Marry Gaussian Splatting with Mamba for Single View 3D Reconstruction[EB/OL]. (2024-05-24) [2025-04-05]. . |
| [26] | Kerbl Bernhard, Kopanas Georgios, Leimkuehler Thomas, et al. 3D Gaussian Splatting for Real-time Radiance Field Rendering[J]. ACM Transactions on Graphics, 2023, 42(4): 139. |
| [27] | Zhang Zeyu, Liu Akide, Reid Ian, et al. Motion Mamba: Efficient and Long Sequence Motion Generation[C]//Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025: 265-282. |
| [28] | Fu Chencan, Wang Yabiao, Zhang Jiangning, et al. MambaGesture: Enhancing Co-speech Gesture Generation with Mamba and Disentangled Multi-modality Fusion[C]//Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 10794-10803. |
| [29] | Rombach Robin, Blattmann Andreas, Lorenz Dominik, et al. High-resolution Image Synthesis with Latent Diffusion Models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10674-10685. |
| [30] | Lee S, Hoover B, Strobelt H, et al. Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion[C]//2024 IEEE Visualization and Visual Analytics (VIS). Piscataway: IEEE, 2024: 96-100. |
| [31] | 林馨怡, 吴泓嘉, 袁稚庭, 等. 基于图像提取与修复的计算机辅助古画印章分析[J]. 计算机辅助设计与图形学学报, 2025, 37(2): 254-264. |
| Lin Xinyi, Wu Hongjia, Yuan Zhiting, et al. Computer Aided Analysis of Ancient Painting Seals Based on Image Extraction and Restoration[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(2): 254-264. | |
| [32] | Kim Taehoon, Kang ChanHee, Park JaeHyuk, et al. Human Motion Aware Text-to-video Generation with Explicit Camera Control[C]//2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Piscataway: IEEE, 2024: 5069-5078. |
| [33] | Tevet G, Raab S, Gordon B, et al. Human Motion Diffusion Model[C]//ICLR 2023 Conference. New York: ICLR, 2023: 1-16. |
| [34] | Chen Xin, Jiang Biao, Liu Wen, et al. Executing Your Commands via Motion Diffusion in Latent Space[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 18000-18010. |
| [35] | Xie Yiming, Jampani V, Zhong Lei, et al. OmniControl: Control Any Joint at Any Time for Human Motion Generation[C]//ICLR 2024 Conference. New York: ICLR, 2024: 1-19. |
| [36] | Zhang Lümin, Rao Anyi, Agrawala M. Adding Conditional Control to Text-to-image Diffusion Models[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 3813-3824. |
| [37] | Zhou Yanqi, Lei Tao, Liu Hanxiao, et al. Mixture-of-experts with Expert Choice Routing[EB/OL]. (2022-10-14) [2025-04-05]. . |
| [38] | Tseng J, Castellon R, Liu C K. EDGE: Editable Dance Generation from Music[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 448-458. |
| [39] | 石敏, 孙碧莲, 朱登明, 等. KM 2D: 舞蹈动作基元符号和音乐语义驱动的舞蹈动画生成方法[J/OL]. 计算机辅助设计与图形学学报. (2025-03-15) [2025-08-12]. . |
| Shi Min, Sun Bilian, Zhu Dengming, et al. KM 2D: Method for Generating Dance Animation Driven by Dance Movement Primitives and Musical Semantics[J/OL]. Journal of Computer-aided Design & Computer Graphics. (2025-03-15) [2025-08-12]. . | |
| [40] | 李晨光, 温玉辉, 景宇宸, 等. 体型感知的音乐驱动舞蹈动作风格化生成[J/OL]. 计算机辅助设计与图形学学报. (2025-02-07) [2025-08-12]. . |
| Li Chenguang, Wen Yuhui, Jing Yuchen, et al. Shape-aware Stylized Dance Motion Generation Driven by Music[J/OL]. Journal of Computer-aided Design & Computer Graphics. (2025-02-07) [2025-08-12]. . | |
| [41] | Ao Tenglong, Zhang Zeyi, Liu Libin. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents[J]. ACM Transactions on Graphics, 2023, 42(4): 42. |
| [42] | Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models from Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2021: 8748-8763. |
| [43] | Chen Junming, Liu Yunfei, Wang Jianan, et al. DiffSHEG: A Diffusion-based Approach for Real-time Speech-driven Holistic 3D Expression and Gesture Generation[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 7352-7361. |
| [44] | 王愫, 刘月林, 孙利. 视觉感知数据驱动的产品造型智能生成设计方法[J/OL]. 计算机辅助设计与图形学学报. (2025-02-17) [2025-08-12]. . |
| Wang Su, Liu Yuelin, Sun Li. An Intelligent Generative Design Method for Product Styling Driven by Visual Perception Data[J/OL]. Journal of Computer-Aided Design & Computer Graphics. (2025-02-17) [2025-08-12]. . |
| [1] | Zou Changjun, Ge Zhiyu, Zhong Chenxi. Spatio-temporal Swin Transformer-based Flow-solid Coupling Interaction Sequence Image Prediction Network [J]. Journal of System Simulation, 2026, 38(1): 112-124. |
| [2] | Zou Jing, Tan Xu, Mao Junji, Gao Haidong, Tan Jianrong. Defect Detection Method Based on Hierarchical Microscopic Feature Modeling and Simulation [J]. Journal of System Simulation, 2026, 38(1): 1-13. |
| [3] | Zhong Yuyou, Shen Xukun, Hu Yong. Research on Real-time Animatable Human Avatar Generation via 3D Gaussian Splatting [J]. Journal of System Simulation, 2026, 38(1): 200-210. |
| [4] | Jiang Lun, Wang Dajiang, Sun Wenlei, Bao Shenghui, Liu Han, Chang Saike. Research on Transformer Fault Diagnosis Method Based on Digital Twin [J]. Journal of System Simulation, 2025, 37(3): 775-790. |
| [5] | Gu Hao, Wang Jiayu, Xiong Weili. Soft Sensor Modeling Based on Improved Transformer in Dual-stream Framework [J]. Journal of System Simulation, 2025, 37(10): 2594-2604. |
| [6] | Li Dongxue, Liu Yan, Shen Boyao, Jing Yongteng, Ma Qiang, Liu Ran. Carbon Footprint Analysis and Low-carbon Optimization Method Simulation Study of Power Transformer Based on Digital Twin Technology [J]. Journal of System Simulation, 2024, 36(9): 2075-2085. |
| [7] | Liu Peijin, Fu Xuefeng, Sun Haofeng, He Lin, Liu Shujie. A Highly Robust Target Tracking Algorithm Merging CNN and Transformer [J]. Journal of System Simulation, 2024, 36(8): 1854-1868. |
| [8] | Li Xiang, Sang Haifeng. Dense Video Description Method Based on Multi-modal Fusion in Transformer Network [J]. Journal of System Simulation, 2024, 36(5): 1061-1071. |
| [9] | Xu Zhongkai, Liu Yanling, Sheng Xiaojuan, Wang Chao, Ke Wenjun. Automatic Detection Algorithm for Typical Defects of Substation Based on Improved YOLOv5 [J]. Journal of System Simulation, 2024, 36(11): 2604-2615. |
| [10] | Wang Gaihua, Li Kehong, Long Qian, Yao Jingxuan, Zhu Bolun, Zhou Zhengshu, Pan Xuran. Object Detection of Lightweight Transformer Based on Knowledge Distillation [J]. Journal of System Simulation, 2024, 36(11): 2517-2527. |
| [11] | Xuqiang Shao, Haowei Zhang, Xiaohua Feng. Multi-sensory Fusion Method for Power Transformer Virtual Assembly [J]. Journal of System Simulation, 2022, 34(10): 2244-2254. |
| [12] | Zheng Yanyan, Zhu Yongli, Liu Tongtong, Su Shanshan. Single-Phase Grounding Fault Location in Wind Farm Based on Zero-Sequence Current [J]. Journal of System Simulation, 2019, 31(7): 1408-1415. |
| [13] | Tang Yongbo, Xiong Yinguo. Transformer Fault Diagnosis Based on Feature Extraction of Relative Transformation Principal Component Analysis [J]. Journal of System Simulation, 2018, 30(3): 1127-1134. |
| [14] | Lü Fangcheng, Guo Yunxiang, Li Peng. Optimal Design of High Power Medium Frequency Transformer Based on Channel Lightweight Requirement [J]. Journal of System Simulation, 2017, 29(1): 154-161. |
| [15] | Wang Yongqiang, Zheng Zhihong, Ouyang Baolong, Li Jianfang. Calculation of Transformer Core Loss Based on Finite Element Coupling Algorithm [J]. Journal of System Simulation, 2016, 28(8): 1757-1763. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||