系统仿真学报 ›› 2026, Vol. 38 ›› Issue (1): 211-224.doi: 10.16182/j.issn1004731x.joss.25-0833

• 论文 • 上一篇    下一篇

基于时空增强生成模型的协同音频人体全身动作生成

张硕喆1, 宋文凤1, 侯霞1, 李帅2,3   

  1. 1.北京信息科技大学,北京 102206
    2.北京航空航天大学 虚拟现实技术与系统全国重点实验室,北京 100191
    3.中关村国家实验室,北京 100194
  • 收稿日期:2025-09-02 修回日期:2025-12-12 出版日期:2026-01-18 发布日期:2026-01-28
  • 通讯作者: 宋文凤
  • 第一作者简介:张硕喆(2001-),男,硕士生,研究方向为计算机视觉、数字人。
  • 基金资助:
    国家自然科学基金(62572062);国家自然科学基金(62525204);北京市自然科学基金(L232102)

Full-body Co-speech Gesture Generation Based on Spatial-temporal Enhanced Generation Model

Zhang Shuozhe1, Song Wenfeng1, Hou Xia1, Li Shuai2,3   

  1. 1.Beijing Information Science & Technology University, Beijing 102206, China
    2.National Key Laboratory of Virtual Reality Technology and System, Beihang University, Beijing 100191, China
    3.Zhongguancun National Laboratory, Beijing 100194, China
  • Received:2025-09-02 Revised:2025-12-12 Online:2026-01-18 Published:2026-01-28
  • Contact: Song Wenfeng

摘要:

生成与音频同步的演讲手势能够显著增强虚拟数字人的交互性,要求生成的手势动作不仅与语音精确同步,还需呈现逼真的全身动态。针对现有基于Transformer的方法通常忽略了动作序列的时间特征,而基于扩散模型的方法则未充分考虑不同身体部位间的空间关联性的问题,提出了一种结合扩散模型、Mamba和注意力机制的方法,用于实现全身动作生成,引入空间自注意力-时序状态空间模型(STMamba Layer)作为降噪网络的核心模块,提取不同部位之间的空间特征以及同一部位的时序特征,提升动作的质量和多样性。将全身动作序列划分为空间和时间两个维度进行特征建模:在空间维度上,利用旋转相对位置编码和自注意力机制捕捉不同身体部位关节点的空间关联性;在时间维度上,利用Mamba捕获动作序列中同一部位的时序动态信息,以增强动作的连续性。在大规模音频-文本-动作数据集BEAT2上进行了实验验证和性能评估。结果表明:所提方法不仅保真度和多样性得到增强,同时能保持较高的推理速度。

关键词: 虚拟数字人, 协同音频人体动作生成, 条件扩散模型, Transformer, Mamba

Abstract:

Full-body co-speech gesture generation significantly enhances the interactivity of virtual digital humans, requiring generated gestures to not only align accurately with speech but also demonstrate realistic full-body dynamics. To address limitations of existing methods—Transformer-based approaches often overlook temporal features of action sequences, while diffusion model-based ones inadequately capture spatial correlations between body parts, a full-body action generation method integrating diffusion models, Mamba, and attention mechanisms is proposed.We introduce the spatial self-attention-temporal state space model (STMamba Layer) as the core of denoising network to extract inter-part spatial features and intra-part temporal features, thus enhancing action quality and diversity. Body motion sequences are modeled in two dimensions: spatially, rotational relative positional encoding and self-attention capture spatial correlations among body joint points; Mamba captures intra-part temporal dynamics in action sequences to boost continuity. Experiments and evaluations on the large-scale audio-text-action dataset BEAT2 demonstrate that the proposed method outperforms state-of-the-art approaches in both fidelity and diversity, while maintaining competitive inference speed despite performance gains.

Key words: human Avatar, full-body co-speech gesture generation, conditional diffusion model, Transformer, Mamba

中图分类号: