系统仿真学报 ›› 2026, Vol. 38 ›› Issue (1): 136-157.doi: 10.16182/j.issn1004731x.joss.25-0832

• 论文 • 上一篇    下一篇

结合细粒度文本与空间控制信号的人体动作扩散模型

蒋滨泽1, 宋文凤1, 侯霞1, 李帅2,3   

  1. 1.北京信息科技大学,北京 102206
    2.北京航空航天大学 虚拟现实技术与系统全国重点实验室,北京 100191
    3.中关村国家实验室,北京 100194
  • 收稿日期:2025-09-02 修回日期:2025-10-11 出版日期:2026-01-18 发布日期:2026-01-28
  • 通讯作者: 宋文凤
  • 第一作者简介:蒋滨泽(2001-),男,硕士生,研究方为计算机视觉,条件驱动人体运动生成。
  • 基金资助:
    国家自然基金(62572062);国家自然基金(62525204);北京市自然科学基金(L232102)

Diffusion Model for Human Motion Generation with Fine-grained Text and Spatial Control Signals

Jiang Binze1, Song Wenfeng1, Hou Xia1, Li Shuai2,3   

  1. 1.Beijing Information Science & Technology University, Beijing 102206, China
    2.National Key Laboratory of Virtual Reality Technology and System, Beihang University, Beijing 100191, China
    3.Zhongguancun National Laboratory, Beijing 100194, China
  • Received:2025-09-02 Revised:2025-10-11 Online:2026-01-18 Published:2026-01-28
  • Contact: Song Wenfeng

摘要:

为提升文本驱动动作生成的精确性、可控性与真实感,提出了一种融合细粒度文本语义信息与空间控制信号的人体动作生成 方法 。在扩散模型框架下,引入全局文本标记与身体部位级别的局部标记,通过CLIP编码后得到对应的特征,输入到动作扩散模型,实现对不同身体部位的精细控制。利用空间指导在扩散去噪过程中动态调整关节位置,使生成动作满足空间约束;结合真实性指导,有效改善未受控关节的自然性与整体协调性。基于HumanML3D数据集进行实验,使用ChatGPT-4o对44 970条文本进行细粒度重写,提升文本与动作的语义对齐度。结果表明:所提方法在动作语义一致性、空间控制精度和生成质量等方面均优于现有方法,能够生成在语义一致性与动作质量上均符合用户预期的人体运动。

关键词: 人体运动, 细粒度文本, 多模态融合, 扩散模型, 空间控制信号

Abstract:

To improve the accuracy, controllability, and realism of text-driven human motion generation, a novel method is proposed that integrates fine-grained textual semantics with spatial control signals. Within the diffusion model framework, both global text tokens and body-part-level local tokens are introduced. These are encoded using CLIP to obtain corresponding features, which are then fed into the motion diffusion model to enable fine control over different body parts. Spatial guidance is used to dynamically adjust joint positions during the diffusion denoising process, ensuring that the generated motion adheres to spatial constraints. Realism guidance is incorporated to enhance the naturalness and overall coordination of uncontrolled joints. Experiments conducted on the HumanML3D dataset involved fine-grained rewriting of 44 970 text samples using ChatGPT-4o to improve semantic alignment between text and motion. Results demonstrate that the proposed method outperforms existing approaches in motion semantic consistency, spatial control accuracy, and generation quality, and is capable of producing human motions that meet user expectations in both semantic alignment and motion quality.

Key words: human motion, fine-grained text, multimodal fusion, diffusion models, spatial control signals

中图分类号: