系统仿真学报 ›› 2024, Vol. 36 ›› Issue (5): 1061-1071.doi: 10.16182/j.issn1004731x.joss.23-0017
收稿日期:
2023-01-04
修回日期:
2023-03-24
出版日期:
2024-05-15
发布日期:
2024-05-21
通讯作者:
桑海峰
E-mail:lixiang3278@163.com;sanghaif@163.com
第一作者简介:
李想(1999-),女,硕士生,研究方向为视频描述。E-mail:lixiang3278@163.com
基金资助:
Received:
2023-01-04
Revised:
2023-03-24
Online:
2024-05-15
Published:
2024-05-21
Contact:
Sang Haifeng
E-mail:lixiang3278@163.com;sanghaif@163.com
摘要:
针对目前的密集视频描述模型大多使用两阶段的方法存在效率较低、忽略音频及语义信息,描述结果不全面的问题。提出了一种基于Transformer网络多模态和语义信息融合的密集视频描述方法。提取自适应R(2+1)D网络提取视觉特征,设计了语义探测器生成语义信息,加入音频特征进行补充,建立了多尺度可变形注意力模块,应用并行的预测头,加快模型收敛速度,提高模型精度。实验结果表明:模型在2个基准数据集上性能均有很好的表现,评价指标BLEU4上达到了2.17。
中图分类号:
李想,桑海峰 . 基于Transformer网络多模态融合的密集视频描述方法[J]. 系统仿真学报, 2024, 36(5): 1061-1071.
Li Xiang,Sang Haifeng . Dense Video Description Method Based on Multi-modal Fusion in Transformer Network[J]. Journal of System Simulation, 2024, 36(5): 1061-1071.
1 | Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to Sequence-video to Text[C]//2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2015: 4534-4542. |
2 | Krishna R, Kenji Hata, Ren F, et al. Dense-captioning Events in Videos[C]//2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2017: 706-715. |
3 | Duan Xuguang, Huang Wenbing, Gan Chuang, et al. Weakly Supervised Dense Event Captioning in Videos[EB/OL]. (2018-12-10) [2022-07-12]. . |
4 | Jiao Yifan, Li Zhetao, Huang Shucheng, et al. Three-dimensional Attention-based Deep Ranking Model for Video Highlight Detection[J]. IEEE Transactions on Multimedia, 2018, 20(10): 2693-2705. |
5 | Ning Ke, Cai Ming, Xie Di, et al. An Attentive Sequence to Sequence Translator for Localizing Video Clips by Natural Language[J]. IEEE Transactions on Multimedia, 2020, 22(9): 2434-2443. |
6 | Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2017: 6000-6010. |
7 | Yu Zhou, Han Nanjia. Accelerated Masked Transformer for Dense Video Captioning[J]. Neurocomputing, 2021, 445: 72-80. |
8 | Iashin Vladimir, Rahtu Esa. A Better Use of Audio-visual Cues: Dense Video Captioning with Bi-modal Transformer[C]//The 31st British Machine Vision Conference. Durham: BMVC, 2020: 111. |
9 | Chang Zhi, Zhao Dexin, Chen Huilin, et al. Event-centric Multi-modal Fusion Method for Dense Video Captioning[J]. Neural Networks, 2022, 146: 120-129. |
10 | Xu Yuecong, Yang Jianfei, Mao Kezhi. Semantic-filtered Soft-split-aware Video Captioning with Audio-augmented Feature[J]. Neurocomputing, 2019, 357: 24-35. |
11 | Wu Chunlei, Wei Yiwei, Chu Xiaoliang, et al. Hierarchical Attention-based Multimodal Fusion for Video Captioning[J]. Neurocomputing, 2018, 315: 362-370. |
12 | Lee Sujin, Kim Incheol. Learning Semantic Features for Dense Video Captioning[J]. Journal of KIISE, 2019, 46(8): 753-762. |
13 | Wang Teng, Zheng Huicheng, Yu Mingjing, et al. Event-centric Hierarchical Representation for Dense Video Captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1890-1900. |
14 | Zhang Zhiwang, Xu Dong, Ouyang Wanli, et al. Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(9): 3130-3139. |
15 | Wang Teng, Zhang Ruimao, Lu Zhichao, et al. End-to-end Dense Video Captioning with Parallel Decoding[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2021: 6827-6837. |
16 | Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA, USA: ACL, 2005: 65-72. |
17 | Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based Image Description Evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2015: 4566-4575. |
18 | Fujita Soichiro, Hirao Tsutomu, Kamigaito Hidetaka, et al. SODA: Story Oriented Dense Video Captioning Evaluation Framework[C]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 517-531. |
19 | Dai Zihang, Yang Zhilin, Yang Yiming, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-length Context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2019: 2978-2988. |
20 | Ryu Hobin, Kang Sunghun, Kang Haeyong, et al. Semantic Grouping Network for Video Captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3), 2514-2522. |
21 | Gabeur Valentin, Sun Chen, Alahari Karteek, et al. Multi-modal Transformer for Video Retrieval[C]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 214-229. |
22 | Lei Jie, Wang Liwei, Shen Yelong, et al. MART: Memory-augmented Recurrent Transformer for Coherent Video Paragraph Captioning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2020: 2603-2614. |
[1] | 包为民, 祁振强. 航天装备体系化仿真发展的思考[J]. 系统仿真学报, 2024, 36(6): 1257-1272. |
[2] | 李清栋, 叶家全, 许健. 仪表着陆系统下滑信标运行保护区仿真研究[J]. 系统仿真学报, 2024, 36(6): 1273-1284. |
[3] | 罗天羽, 邢立宁, 王锐, 王凌, 石建迈, 孙昕. 基于改进差分进化算法的动态防空资源分配优化[J]. 系统仿真学报, 2024, 36(6): 1285-1297. |
[4] | 邓明君, 胡辛瑕, 李响, 徐丽萍. 基于车速引导和感应控制的干线协调优化方法[J]. 系统仿真学报, 2024, 36(6): 1309-1321. |
[5] | 路阳, 刘鹏飞, 许思源, 刘启旺, 顾福谦, 王鹏. 改进注意力机制嵌入PR-Net模型的水稻病害识别仿真[J]. 系统仿真学报, 2024, 36(6): 1322-1333. |
[6] | 黄林, 刘善君, 王伟, 龚立. 基于随机邻域嵌入的无监督复杂工况识别[J]. 系统仿真学报, 2024, 36(6): 1334-1343. |
[7] | 刘诗昆, 唐易, 刘永红. 驾驶模拟技术在交通仿真参数标定中的应用研究[J]. 系统仿真学报, 2024, 36(6): 1359-1368. |
[8] | 卫升, 王艳, 纪志成. 多工况生产过程下的即时学习能耗预测建模方法[J]. 系统仿真学报, 2024, 36(6): 1378-1391. |
[9] | 王飞, 苌道方, 温富荣. 基于集装箱作业标记的U型堆场分类协同调度[J]. 系统仿真学报, 2024, 36(6): 1392-1403. |
[10] | 蒋昌健, 樊虎, 罗陶, 袁文, 何泽豪. 面向飞机装配批架次完工时间的仿真预测方法[J]. 系统仿真学报, 2024, 36(6): 1404-1413. |
[11] | 朱子璐, 刘永奎, 张霖, 王力翚, 林廷宇. 基于深度强化学习的机器人轴孔装配策略仿真研究[J]. 系统仿真学报, 2024, 36(6): 1414-1424. |
[12] | 苏本跃, 朱邦国, 郭梦娟, 盛敏. 融合球空间下旋转角度编码的人体动作识别[J]. 系统仿真学报, 2024, 36(6): 1433-1441. |
[13] | 陈明哲, 宋韫峥, 王佩, 张镭. 两栖装甲车水上运动仿真平台开发与应用研究[J]. 系统仿真学报, 2024, 36(6): 1442-1451. |
[14] | 祝靖宇, 张宏立, 匡敏驰, 史恒, 朱纪洪, 乔直, 周文卿. 稀疏奖励下基于课程学习的无人机空战仿真[J]. 系统仿真学报, 2024, 36(6): 1452-1467. |
[15] | 李勇波, 田润梅, 张辉, 郭善鹏, 李琪. 基于Windows/RTX的实时仿测软件设计[J]. 系统仿真学报, 2024, 36(6): 1468-1474. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||