| 1 |
Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to Sequence-video to Text[C]//2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2015: 4534-4542.
|
| 2 |
Krishna R, Kenji Hata, Ren F, et al. Dense-captioning Events in Videos[C]//2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2017: 706-715.
|
| 3 |
Duan Xuguang, Huang Wenbing, Gan Chuang, et al. Weakly Supervised Dense Event Captioning in Videos[EB/OL]. (2018-12-10) [2022-07-12]. .
|
| 4 |
Jiao Yifan, Li Zhetao, Huang Shucheng, et al. Three-dimensional Attention-based Deep Ranking Model for Video Highlight Detection[J]. IEEE Transactions on Multimedia, 2018, 20(10): 2693-2705.
|
| 5 |
Ning Ke, Cai Ming, Xie Di, et al. An Attentive Sequence to Sequence Translator for Localizing Video Clips by Natural Language[J]. IEEE Transactions on Multimedia, 2020, 22(9): 2434-2443.
|
| 6 |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2017: 6000-6010.
|
| 7 |
Yu Zhou, Han Nanjia. Accelerated Masked Transformer for Dense Video Captioning[J]. Neurocomputing, 2021, 445: 72-80.
|
| 8 |
Iashin Vladimir, Rahtu Esa. A Better Use of Audio-visual Cues: Dense Video Captioning with Bi-modal Transformer[C]//The 31st British Machine Vision Conference. Durham: BMVC, 2020: 111.
|
| 9 |
Chang Zhi, Zhao Dexin, Chen Huilin, et al. Event-centric Multi-modal Fusion Method for Dense Video Captioning[J]. Neural Networks, 2022, 146: 120-129.
|
| 10 |
Xu Yuecong, Yang Jianfei, Mao Kezhi. Semantic-filtered Soft-split-aware Video Captioning with Audio-augmented Feature[J]. Neurocomputing, 2019, 357: 24-35.
|
| 11 |
Wu Chunlei, Wei Yiwei, Chu Xiaoliang, et al. Hierarchical Attention-based Multimodal Fusion for Video Captioning[J]. Neurocomputing, 2018, 315: 362-370.
|
| 12 |
Lee Sujin, Kim Incheol. Learning Semantic Features for Dense Video Captioning[J]. Journal of KIISE, 2019, 46(8): 753-762.
|
| 13 |
Wang Teng, Zheng Huicheng, Yu Mingjing, et al. Event-centric Hierarchical Representation for Dense Video Captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1890-1900.
|
| 14 |
Zhang Zhiwang, Xu Dong, Ouyang Wanli, et al. Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(9): 3130-3139.
|
| 15 |
Wang Teng, Zhang Ruimao, Lu Zhichao, et al. End-to-end Dense Video Captioning with Parallel Decoding[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2021: 6827-6837.
|
| 16 |
Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA, USA: ACL, 2005: 65-72.
|
| 17 |
Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based Image Description Evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2015: 4566-4575.
|
| 18 |
Fujita Soichiro, Hirao Tsutomu, Kamigaito Hidetaka, et al. SODA: Story Oriented Dense Video Captioning Evaluation Framework[C]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 517-531.
|
| 19 |
Dai Zihang, Yang Zhilin, Yang Yiming, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-length Context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2019: 2978-2988.
|
| 20 |
Ryu Hobin, Kang Sunghun, Kang Haeyong, et al. Semantic Grouping Network for Video Captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3), 2514-2522.
|
| 21 |
Gabeur Valentin, Sun Chen, Alahari Karteek, et al. Multi-modal Transformer for Video Retrieval[C]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 214-229.
|
| 22 |
Lei Jie, Wang Liwei, Shen Yelong, et al. MART: Memory-augmented Recurrent Transformer for Coherent Video Paragraph Captioning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2020: 2603-2614.
|