系统仿真学报 ›› 2024, Vol. 36 ›› Issue (5): 1061-1071.doi: 10.16182/j.issn1004731x.joss.23-0017

• 研究论文 • 上一篇    下一篇

基于Transformer网络多模态融合的密集视频描述方法

李想(), 桑海峰()   

  1. 沈阳工业大学 信息科学与工程学院,辽宁 沈阳 110870
  • 收稿日期:2023-01-04 修回日期:2023-03-24 出版日期:2024-05-15 发布日期:2024-05-21
  • 通讯作者: 桑海峰 E-mail:lixiang3278@163.com;sanghaif@163.com
  • 第一作者简介:李想(1999-),女,硕士生,研究方向为视频描述。E-mail:lixiang3278@163.com
  • 基金资助:
    国家自然科学基金(62173078);辽宁省自然科学基金(2022-MS-268)

Dense Video Description Method Based on Multi-modal Fusion in Transformer Network

Li Xiang(), Sang Haifeng()   

  1. School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China
  • Received:2023-01-04 Revised:2023-03-24 Online:2024-05-15 Published:2024-05-21
  • Contact: Sang Haifeng E-mail:lixiang3278@163.com;sanghaif@163.com

摘要:

针对目前的密集视频描述模型大多使用两阶段的方法存在效率较低、忽略音频及语义信息,描述结果不全面的问题。提出了一种基于Transformer网络多模态和语义信息融合的密集视频描述方法。提取自适应R(2+1)D网络提取视觉特征,设计了语义探测器生成语义信息,加入音频特征进行补充,建立了多尺度可变形注意力模块,应用并行的预测头,加快模型收敛速度,提高模型精度。实验结果表明:模型在2个基准数据集上性能均有很好的表现,评价指标BLEU4上达到了2.17。

关键词: 密集事件描述, Transformer网络, 语义信息, 多模态融合, 可变形注意力

Abstract:

In order to solve the problems that most of the current dense video description models use two-stage methods, which have low efficiency, ignore audio and semantic information, and have incomplete description results, a multi-modal and semantic information fusion dense video description method was proposed. An adaptive R(2+1)D network was proposed to extract visual features, a semantic detector was designed to generate semantic information, audio features were added to supplement it, a multi-scale deformable attention module was established, and a parallel prediction head was applied to accelerate the convergence rate and improve the accuracy of the model. The experimental results show that the model has good performance on the two benchmark datasets, and the evaluation index BLEU4 reaches 2.17.

Key words: dense event description, Transformer network, semantic information, multi-modal fusion, deformable attention

中图分类号: