基于Transformer网络多模态融合的密集视频描述方法

doi:10.16182/j.issn1004731x.joss.23-0017

摘要/Abstract

摘要：

针对目前的密集视频描述模型大多使用两阶段的方法存在效率较低、忽略音频及语义信息，描述结果不全面的问题。提出了一种基于Transformer网络多模态和语义信息融合的密集视频描述方法。提取自适应R(2+1)D网络提取视觉特征，设计了语义探测器生成语义信息，加入音频特征进行补充，建立了多尺度可变形注意力模块，应用并行的预测头，加快模型收敛速度，提高模型精度。实验结果表明：模型在2个基准数据集上性能均有很好的表现，评价指标BLEU4上达到了2.17。

关键词: 密集事件描述, Transformer网络, 语义信息, 多模态融合, 可变形注意力

Abstract:

In order to solve the problems that most of the current dense video description models use two-stage methods, which have low efficiency, ignore audio and semantic information, and have incomplete description results, a multi-modal and semantic information fusion dense video description method was proposed. An adaptive R(2+1)D network was proposed to extract visual features, a semantic detector was designed to generate semantic information, audio features were added to supplement it, a multi-scale deformable attention module was established, and a parallel prediction head was applied to accelerate the convergence rate and improve the accuracy of the model. The experimental results show that the model has good performance on the two benchmark datasets, and the evaluation index BLEU4 reaches 2.17.

Key words: dense event description, Transformer network, semantic information, multi-modal fusion, deformable attention

中图分类号:

TP391

李想,桑海峰 . 基于Transformer网络多模态融合的密集视频描述方法[J]. 系统仿真学报, 2024, 36(5): 1061-1071.

Li Xiang,Sang Haifeng . Dense Video Description Method Based on Multi-modal Fusion in Transformer Network[J]. Journal of System Simulation, 2024, 36(5): 1061-1071.

图/表 12

图1

图2

图3

图4

表1

表2

表3

图5

表4

表5

表6

表7

参考文献 22

1	Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to Sequence-video to Text[C]//2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2015: 4534-4542.
2	Krishna R, Kenji Hata, Ren F, et al. Dense-captioning Events in Videos[C]//2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2017: 706-715.
3	Duan Xuguang, Huang Wenbing, Gan Chuang, et al. Weakly Supervised Dense Event Captioning in Videos[EB/OL]. (2018-12-10) [2022-07-12]. .
4	Jiao Yifan, Li Zhetao, Huang Shucheng, et al. Three-dimensional Attention-based Deep Ranking Model for Video Highlight Detection[J]. IEEE Transactions on Multimedia, 2018, 20(10): 2693-2705.
5	Ning Ke, Cai Ming, Xie Di, et al. An Attentive Sequence to Sequence Translator for Localizing Video Clips by Natural Language[J]. IEEE Transactions on Multimedia, 2020, 22(9): 2434-2443.
6	Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2017: 6000-6010.
7	Yu Zhou, Han Nanjia. Accelerated Masked Transformer for Dense Video Captioning[J]. Neurocomputing, 2021, 445: 72-80.
8	Iashin Vladimir, Rahtu Esa. A Better Use of Audio-visual Cues: Dense Video Captioning with Bi-modal Transformer[C]//The 31st British Machine Vision Conference. Durham: BMVC, 2020: 111.
9	Chang Zhi, Zhao Dexin, Chen Huilin, et al. Event-centric Multi-modal Fusion Method for Dense Video Captioning[J]. Neural Networks, 2022, 146: 120-129.
10	Xu Yuecong, Yang Jianfei, Mao Kezhi. Semantic-filtered Soft-split-aware Video Captioning with Audio-augmented Feature[J]. Neurocomputing, 2019, 357: 24-35.
11	Wu Chunlei, Wei Yiwei, Chu Xiaoliang, et al. Hierarchical Attention-based Multimodal Fusion for Video Captioning[J]. Neurocomputing, 2018, 315: 362-370.
12	Lee Sujin, Kim Incheol. Learning Semantic Features for Dense Video Captioning[J]. Journal of KIISE, 2019, 46(8): 753-762.
13	Wang Teng, Zheng Huicheng, Yu Mingjing, et al. Event-centric Hierarchical Representation for Dense Video Captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1890-1900.
14	Zhang Zhiwang, Xu Dong, Ouyang Wanli, et al. Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(9): 3130-3139.
15	Wang Teng, Zhang Ruimao, Lu Zhichao, et al. End-to-end Dense Video Captioning with Parallel Decoding[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2021: 6827-6837.
16	Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA, USA: ACL, 2005: 65-72.
17	Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based Image Description Evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2015: 4566-4575.
18	Fujita Soichiro, Hirao Tsutomu, Kamigaito Hidetaka, et al. SODA: Story Oriented Dense Video Captioning Evaluation Framework[C]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 517-531.
19	Dai Zihang, Yang Zhilin, Yang Yiming, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-length Context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2019: 2978-2988.
20	Ryu Hobin, Kang Sunghun, Kang Haeyong, et al. Semantic Grouping Network for Video Captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3), 2514-2522.
21	Gabeur Valentin, Sun Chen, Alahari Karteek, et al. Multi-modal Transformer for Video Retrieval[C]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 214-229.
22	Lei Jie, Wang Liwei, Shen Yelong, et al. MART: Memory-augmented Recurrent Transformer for Coherent Video Paragraph Captioning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2020: 2603-2614.

模型	Bleu_4	METEOR	CIDEr	SODA_c
Transformer-XL^[19]	1.93	10.03	40.32	5.21
SGN^[20]	1.75	9.43	40.33	—
BMT^[8]	1.99	8.78	39.12	5.45
MMT^[21]	1.51	8.62	40.02	—
MART^[22]	1.93	8.93	41.32	5.66
PDVC^[15]	2.07	9.34	42.05	6.11
MSTVC	2.17	9.03	41.14	6.05

模型	Bleu_4	METEOR	CIDEr
BMT	1.75	7.55	25.33
MMT	1.70	7.41	26.66
Transformer-XL	1.65	8.03	28.52
MART	1.78	7.68	28.11
SGN	1.80	8.08	29.59
MSTVC	1.87	8.01	29.04

模型	Bleu_4	METEOR	CIDEr	SODA_c
BMT	0.81	3.75	21.01	3.95
Transformer-XL	0.76	3.43	20.33	3.55
PDVC	0.87	4.54	22.78	4.42
MSTVC	0.92	4.25	21.65	4.22

网络	参数量	计算量	METEOR
C3D	98.32	59.25	8.09
R(2+1)-18	33.15	30.55	6.25
R(2+1)-34	45.26	60.56	8.89
R(2+1)-152	100.88	91.54	8.91

Features	Recall	Precision	Bleu_4	METEOR	CIDEr	SODA_c
C3D	55.20	57.36	1.82	8.09	38.16	5.47
TSN	56.21	57.46	1.92	8.63	39.00	5.68
I3D	55.88	57.97	1.97	8.97	40.21	6.11
R(2+1)	55.76	57.88	1.99	8.89	41.01	6.08
A-R(2+1)	55.79	57.39	2.09	9.03	41.14	6.05