Journal of System Simulation ›› 2026, Vol. 38 ›› Issue (1): 211-224.doi: 10.16182/j.issn1004731x.joss.25-0833

• Papers • Previous Articles     Next Articles

Full-body Co-speech Gesture Generation Based on Spatial-temporal Enhanced Generation Model

Zhang Shuozhe1, Song Wenfeng1, Hou Xia1, Li Shuai2,3   

  1. 1.Beijing Information Science & Technology University, Beijing 102206, China
    2.National Key Laboratory of Virtual Reality Technology and System, Beihang University, Beijing 100191, China
    3.Zhongguancun National Laboratory, Beijing 100194, China
  • Received:2025-09-02 Revised:2025-12-12 Online:2026-01-18 Published:2026-01-28
  • Contact: Song Wenfeng

Abstract:

Full-body co-speech gesture generation significantly enhances the interactivity of virtual digital humans, requiring generated gestures to not only align accurately with speech but also demonstrate realistic full-body dynamics. To address limitations of existing methods—Transformer-based approaches often overlook temporal features of action sequences, while diffusion model-based ones inadequately capture spatial correlations between body parts, a full-body action generation method integrating diffusion models, Mamba, and attention mechanisms is proposed.We introduce the spatial self-attention-temporal state space model (STMamba Layer) as the core of denoising network to extract inter-part spatial features and intra-part temporal features, thus enhancing action quality and diversity. Body motion sequences are modeled in two dimensions: spatially, rotational relative positional encoding and self-attention capture spatial correlations among body joint points; Mamba captures intra-part temporal dynamics in action sequences to boost continuity. Experiments and evaluations on the large-scale audio-text-action dataset BEAT2 demonstrate that the proposed method outperforms state-of-the-art approaches in both fidelity and diversity, while maintaining competitive inference speed despite performance gains.

Key words: human Avatar, full-body co-speech gesture generation, conditional diffusion model, Transformer, Mamba

CLC Number: