Journal of System Simulation ›› 2026, Vol. 38 ›› Issue (1): 136-157.doi: 10.16182/j.issn1004731x.joss.25-0832

• Papers • Previous Articles     Next Articles

Diffusion Model for Human Motion Generation with Fine-grained Text and Spatial Control Signals

Jiang Binze1, Song Wenfeng1, Hou Xia1, Li Shuai2,3   

  1. 1.Beijing Information Science & Technology University, Beijing 102206, China
    2.National Key Laboratory of Virtual Reality Technology and System, Beihang University, Beijing 100191, China
    3.Zhongguancun National Laboratory, Beijing 100194, China
  • Received:2025-09-02 Revised:2025-10-11 Online:2026-01-18 Published:2026-01-28
  • Contact: Song Wenfeng

Abstract:

To improve the accuracy, controllability, and realism of text-driven human motion generation, a novel method is proposed that integrates fine-grained textual semantics with spatial control signals. Within the diffusion model framework, both global text tokens and body-part-level local tokens are introduced. These are encoded using CLIP to obtain corresponding features, which are then fed into the motion diffusion model to enable fine control over different body parts. Spatial guidance is used to dynamically adjust joint positions during the diffusion denoising process, ensuring that the generated motion adheres to spatial constraints. Realism guidance is incorporated to enhance the naturalness and overall coordination of uncontrolled joints. Experiments conducted on the HumanML3D dataset involved fine-grained rewriting of 44 970 text samples using ChatGPT-4o to improve semantic alignment between text and motion. Results demonstrate that the proposed method outperforms existing approaches in motion semantic consistency, spatial control accuracy, and generation quality, and is capable of producing human motions that meet user expectations in both semantic alignment and motion quality.

Key words: human motion, fine-grained text, multimodal fusion, diffusion models, spatial control signals

CLC Number: