基于时空增强生成模型的协同音频人体全身动作生成

doi:10.16182/j.issn1004731x.joss.25-0833

摘要/Abstract

摘要：

生成与音频同步的演讲手势能够显著增强虚拟数字人的交互性，要求生成的手势动作不仅与语音精确同步，还需呈现逼真的全身动态。针对现有基于Transformer的方法通常忽略了动作序列的时间特征，而基于扩散模型的方法则未充分考虑不同身体部位间的空间关联性的问题，提出了一种结合扩散模型、Mamba和注意力机制的方法，用于实现全身动作生成，引入空间自注意力-时序状态空间模型(STMamba Layer)作为降噪网络的核心模块，提取不同部位之间的空间特征以及同一部位的时序特征，提升动作的质量和多样性。将全身动作序列划分为空间和时间两个维度进行特征建模：在空间维度上，利用旋转相对位置编码和自注意力机制捕捉不同身体部位关节点的空间关联性；在时间维度上，利用Mamba捕获动作序列中同一部位的时序动态信息，以增强动作的连续性。在大规模音频-文本-动作数据集BEAT2上进行了实验验证和性能评估。结果表明：所提方法不仅保真度和多样性得到增强，同时能保持较高的推理速度。

关键词: 虚拟数字人, 协同音频人体动作生成, 条件扩散模型, Transformer, Mamba

Abstract:

Full-body co-speech gesture generation significantly enhances the interactivity of virtual digital humans, requiring generated gestures to not only align accurately with speech but also demonstrate realistic full-body dynamics. To address limitations of existing methods—Transformer-based approaches often overlook temporal features of action sequences, while diffusion model-based ones inadequately capture spatial correlations between body parts, a full-body action generation method integrating diffusion models, Mamba, and attention mechanisms is proposed.We introduce the spatial self-attention-temporal state space model (STMamba Layer) as the core of denoising network to extract inter-part spatial features and intra-part temporal features, thus enhancing action quality and diversity. Body motion sequences are modeled in two dimensions: spatially, rotational relative positional encoding and self-attention capture spatial correlations among body joint points; Mamba captures intra-part temporal dynamics in action sequences to boost continuity. Experiments and evaluations on the large-scale audio-text-action dataset BEAT2 demonstrate that the proposed method outperforms state-of-the-art approaches in both fidelity and diversity, while maintaining competitive inference speed despite performance gains.

Key words: human Avatar, full-body co-speech gesture generation, conditional diffusion model, Transformer, Mamba

中图分类号:

TP.391.41

张硕喆,宋文凤,侯霞等 . 基于时空增强生成模型的协同音频人体全身动作生成[J]. 系统仿真学报, 2026, 38(1): 211-224.

Zhang Shuozhe,Song Wenfeng,Hou Xia,et al . Full-body Co-speech Gesture Generation Based on Spatial-temporal Enhanced Generation Model[J]. Journal of System Simulation, 2026, 38(1): 211-224.

图/表 10

图1

图2

图3

表1

表2

图 4

表3

图5

图6

图7

参考文献 44

[1]	Liu Haiyang, Zhu Zihao, Iwamoto Naoya, et al. BEAT: A Large-scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis[C]//Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022: 612-630.
[2]	Liu Haiyang, Zhu Zihao, Becherini Giorgio, et al. EMAGE: Towards Unified Holistic Co-speech Gesture Generation via Expressive Masked Audio Gesture Modeling[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 1144-1154.
[3]	Qi Xingqun, Pan Jiahao, Li Peng, et al. Weakly-supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 10424-10434.
[4]	Xu Zunnan, Lin Yukang, Han Haonan, et al. MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 20055-20080.
[5]	Alexanderson Simon, Gustav Eje Henter, Kucherenko Taras, et al. Style-controllable Speech-driven Gesture Synthesis Using Normalising Flows[J]. Computer Graphics Forum, 2020, 39(2): 487-496.
[6]	Chen Bohong, Li Yumeng, Ding Yaoxiang, et al. Enabling Synergistic Full-body Control in Prompt-based Co-speech Motion Generation[C]//Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 6774-6783.
[7]	Yang Sicheng, Wu Zhiyong, Li Minglei, et al. DiffuseStyleGesture: Stylized Audio-driven Co-speech Gesture Generation with Diffusion Models[C]//Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California: IJCAI, 2023: 5860-5868.
[8]	Gu A, Dao T. Mamba: Linear-time Sequence Modeling with Selective State Spaces[EB/OL]. (2024-05-31) [2025-04-05]. .
[9]	Dao T, Gu A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality[C]//Proceedings of the 41st International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2024: 10041-10071.
[10]	Liu Pinxin, Song Luchuan, Huang Junhua, et al. GestureLSM: Latent Shortcut Based Co-speech Gesture Generation with Spatial-temporal Modeling[EB/OL]. (2025-01-31) [2025-04-05]. .
[11]	Zhang Mingyuan, Li Huirong, Cai Zhongang, et al. FineMoGen: Fine-grained Spatio-temporal Motion Generation and Editing[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 13981-13992.
[12]	Liu Xian, Wu Qianyi, Zhou Hang, et al. Learning Hierarchical Cross-modal Association for Co-speech Gesture Generation[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10452-10462.
[13]	Ao Tenglong, Gao Qingzhe, Lou Yuke, et al. Rhythmic Gesticulator: Rhythm-aware Co-speech Gesture Synthesis with Hierarchical Neural Embeddings[J]. ACM Transactions on Graphics, 2022, 41(6): 209.
[14]	Yi Hongwei, Liang Hualin, Liu Yifei, et al. Generating Holistic 3D Human Motion from Speech[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 469-480.
[15]	Hamza Mughal M, Dabral Rishabh, C J Scholman Merel, et al. Retrieving Semantics from the Deep: An RAG Solution for Gesture Synthesis[C]//2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2025: 16578-16588.
[16]	Chen Changan, Zhang Juze, Lakshmikanth S K, et al. The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion[EB/OL]. (2024-12-13) [2025-04-05]. .
[17]	Frans K, Hafner D, Levine S, et al. One Step Diffusion Via Shortcut Models[EB/OL]. (2024-10-16) [2025-08-12]. .
[18]	Lenz B, Lieber O, Arazi A, et al. Jamba: Hybrid Transformer-Mamba Language Models[C]//ICLR 2025 Conference. New York: ICLR, 2025: 1-26.
[19]	Wang Junxiong, Paliotta Daniele, May A, et al. The Mamba in the Llama: Distilling and Accelerating Hybrid Models[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 62432-62457.
[20]	Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and Efficient Foundation Language Models[EB/OL]. (2023-02-27) [2025-04-05]. .
[21]	Zhu Lianghui, Liao Bencheng, Zhang Qian, et al. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model[C]//Proceedings of the 41st International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2024: 62429-62442.
[22]	Liu Yue, Tian Yunjie, Zhao Yuzhong, et al. VMamba: Visual State Space Model[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2024: 103031-103063.
[23]	Tao Hu Vincent, Stefan Andreas Baumann, Gui Ming, et al. ZigMa: A DiT-style Zigzag Mamba Diffusion Model[C]//Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025: 148-166.
[24]	Peebles W, Xie Saining. Scalable Diffusion Models with Transformers[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 4172-4182.
[25]	Shen Qiuhong, Wu Zike, Yi Xuanyu, et al. Gamba: Marry Gaussian Splatting with Mamba for Single View 3D Reconstruction[EB/OL]. (2024-05-24) [2025-04-05]. .
[26]	Kerbl Bernhard, Kopanas Georgios, Leimkuehler Thomas, et al. 3D Gaussian Splatting for Real-time Radiance Field Rendering[J]. ACM Transactions on Graphics, 2023, 42(4): 139.
[27]	Zhang Zeyu, Liu Akide, Reid Ian, et al. Motion Mamba: Efficient and Long Sequence Motion Generation[C]//Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025: 265-282.
[28]	Fu Chencan, Wang Yabiao, Zhang Jiangning, et al. MambaGesture: Enhancing Co-speech Gesture Generation with Mamba and Disentangled Multi-modality Fusion[C]//Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 10794-10803.
[29]	Rombach Robin, Blattmann Andreas, Lorenz Dominik, et al. High-resolution Image Synthesis with Latent Diffusion Models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10674-10685.
[30]	Lee S, Hoover B, Strobelt H, et al. Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion[C]//2024 IEEE Visualization and Visual Analytics (VIS). Piscataway: IEEE, 2024: 96-100.
[31]	林馨怡, 吴泓嘉, 袁稚庭, 等. 基于图像提取与修复的计算机辅助古画印章分析[J]. 计算机辅助设计与图形学学报, 2025, 37(2): 254-264.
	Lin Xinyi, Wu Hongjia, Yuan Zhiting, et al. Computer Aided Analysis of Ancient Painting Seals Based on Image Extraction and Restoration[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(2): 254-264.
[32]	Kim Taehoon, Kang ChanHee, Park JaeHyuk, et al. Human Motion Aware Text-to-video Generation with Explicit Camera Control[C]//2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Piscataway: IEEE, 2024: 5069-5078.
[33]	Tevet G, Raab S, Gordon B, et al. Human Motion Diffusion Model[C]//ICLR 2023 Conference. New York: ICLR, 2023: 1-16.
[34]	Chen Xin, Jiang Biao, Liu Wen, et al. Executing Your Commands via Motion Diffusion in Latent Space[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 18000-18010.
[35]	Xie Yiming, Jampani V, Zhong Lei, et al. OmniControl: Control Any Joint at Any Time for Human Motion Generation[C]//ICLR 2024 Conference. New York: ICLR, 2024: 1-19.
[36]	Zhang Lümin, Rao Anyi, Agrawala M. Adding Conditional Control to Text-to-image Diffusion Models[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 3813-3824.
[37]	Zhou Yanqi, Lei Tao, Liu Hanxiao, et al. Mixture-of-experts with Expert Choice Routing[EB/OL]. (2022-10-14) [2025-04-05]. .
[38]	Tseng J, Castellon R, Liu C K. EDGE: Editable Dance Generation from Music[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 448-458.
[39]	石敏, 孙碧莲, 朱登明, 等. KM 2D: 舞蹈动作基元符号和音乐语义驱动的舞蹈动画生成方法[J/OL]. 计算机辅助设计与图形学学报. (2025-03-15) [2025-08-12]. .
	Shi Min, Sun Bilian, Zhu Dengming, et al. KM 2D: Method for Generating Dance Animation Driven by Dance Movement Primitives and Musical Semantics[J/OL]. Journal of Computer-aided Design & Computer Graphics. (2025-03-15) [2025-08-12]. .
[40]	李晨光, 温玉辉, 景宇宸, 等. 体型感知的音乐驱动舞蹈动作风格化生成[J/OL]. 计算机辅助设计与图形学学报. (2025-02-07) [2025-08-12]. .
	Li Chenguang, Wen Yuhui, Jing Yuchen, et al. Shape-aware Stylized Dance Motion Generation Driven by Music[J/OL]. Journal of Computer-aided Design & Computer Graphics. (2025-02-07) [2025-08-12]. .
[41]	Ao Tenglong, Zhang Zeyi, Liu Libin. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents[J]. ACM Transactions on Graphics, 2023, 42(4): 42.
[42]	Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models from Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning. Chia Laguna Resort: PMLR, 2021: 8748-8763.
[43]	Chen Junming, Liu Yunfei, Wang Jianan, et al. DiffSHEG: A Diffusion-based Approach for Real-time Speech-driven Holistic 3D Expression and Gesture Generation[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 7352-7361.
[44]	王愫, 刘月林, 孙利. 视觉感知数据驱动的产品造型智能生成设计方法[J/OL]. 计算机辅助设计与图形学学报. (2025-02-17) [2025-08-12]. .
	Wang Su, Liu Yuelin, Sun Li. An Intelligent Generative Design Method for Product Styling Driven by Visual Perception Data[J/OL]. Journal of Computer-Aided Design & Computer Graphics. (2025-02-17) [2025-08-12]. .

方法	FGD↓	BC→	Div.↑	MSE↓	AITS↓
CaMN^[1]	6.64	6.763	10.861	—	0.132
DiffStyleGesture^[7]	8.81	7.335	11.492	—	—
EMAGE^[2]	5.43	7.853	13.121	7.911	0.122
Syntalker^[6]	4.65	7.272	12.662	—	2.262
MambaTalk^[4]	5.36	7.932	13.030	6.845	0.065
GestureLSM^[10]	4.08	7.201	13.243	10.017	0.039
本文算法	3.94	7.263	13.782	8.092	0.042

方法	FGD↓	BC→	Div.↑
本文	3.94	7.263	13.782
w SMTA	4.26	7.453	14.121
w SMTM	4.55	7.472	13.962
w SATA	4.10	7.324	13.031
w/o ST	4.67	7.665	12.868

方法	FGD↓	BC→	Div.↑
本文	3.94	7.203	13.782
w MLP	4.68	7.565	12.898
w Gamba^[25]	3.97	7.645	13.448