Journal of System Simulation ›› 2024, Vol. 36 ›› Issue (5): 1061-1071.doi: 10.16182/j.issn1004731x.joss.23-0017

Previous Articles     Next Articles

Dense Video Description Method Based on Multi-modal Fusion in Transformer Network

Li Xiang(), Sang Haifeng()   

  1. School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China
  • Received:2023-01-04 Revised:2023-03-24 Online:2024-05-15 Published:2024-05-21
  • Contact: Sang Haifeng;


In order to solve the problems that most of the current dense video description models use two-stage methods, which have low efficiency, ignore audio and semantic information, and have incomplete description results, a multi-modal and semantic information fusion dense video description method was proposed. An adaptive R(2+1)D network was proposed to extract visual features, a semantic detector was designed to generate semantic information, audio features were added to supplement it, a multi-scale deformable attention module was established, and a parallel prediction head was applied to accelerate the convergence rate and improve the accuracy of the model. The experimental results show that the model has good performance on the two benchmark datasets, and the evaluation index BLEU4 reaches 2.17.

Key words: dense event description, Transformer network, semantic information, multi-modal fusion, deformable attention

CLC Number: