基于时空特征金字塔网络的动作时序检测方法

doi:10.16182/j.issn1004731x.joss.19-FZ0369

系统仿真学报 ›› 2019, Vol. 31 ›› Issue (11): 2382-2387.doi: 10.16182/j.issn1004731x.joss.19-FZ0369

基于时空特征金字塔网络的动作时序检测方法

刘望, 孙金玉, 马世伟^*

上海大学机电工程与自动化学院,上海 200444

收稿日期:2019-05-21 修回日期:2019-07-23 出版日期:2019-11-10 发布日期:2019-12-13
作者简介:刘望(1995-),男,福建福州,硕士生,研究方向为视频检索;马世伟(通讯作者1965-),男,甘肃嘉峪关,博士,教授,研究方向为信号处理、图像处理和模式识别等。
基金资助:
新疆兵团重大项目子项目(2018AA008-04)

A Temporal Action Detection Algorithm Based on Spatio-Temporal Feature Pyramid Network

Liu Wang, Sun Jinyu, Ma Shiwei^*

School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China

Received:2019-05-21 Revised:2019-07-23 Online:2019-11-10 Published:2019-12-13

摘要/Abstract

摘要： 针对帧级别预测网络结构中的动作时序检测提案不连续问题,提出基于时空特征金字塔网络的动作时序检测算法。在帧级别动作预测中,采用多个3D卷积反卷积网络,将空间特征维度降至1维,并将时间特征维度还原到相应的提案长度,得到不同时间尺度下的多个预测。采用非极大值抑制的方式融合多个子网络的预测,并用分类器进行帧级别动作分类,进而得到时序提案。在共享数据集THUMOS14上的实验结果表明,该算法有效地提高了动作的时序区域定位精度。

关键词: 动作时序检测, 特征融合, 时空特征金字塔, 3D卷积反卷积, 非极大值抑制

Abstract: In view of the discontinuity of motion timing detection in the frame-level prediction network structure, a novel algorithm based on spatio-temporal feature pyramid network (ST-FPN) is proposed. In the frame-level action prediction, several 3D convolution-de-convolution (CDC) networks are used to sample spatial feature down to 1 dimension and sample temporal feature up to corresponding proposal level. Then the prediction scores of different CDC networks are fused by non-maximum suppression (NMS). The softmax classifier is used to classify frame-level actions, and then temporal action detection is obtained. The experimental results on dataset THUMOS14 show that the proposed algorithm improves the accuracy of temporal action detection.

Key words: temporal action detection, feature fusion, spatio-temporal feature pyramid network, 3D convolution-de-convolution, non-maximum suppression

中图分类号:

TP391.4

刘望, 孙金玉, 马世伟. 基于时空特征金字塔网络的动作时序检测方法[J]. 系统仿真学报, 2019, 31(11): 2382-2387.

Liu Wang, Sun Jinyu, Ma Shiwei. A Temporal Action Detection Algorithm Based on Spatio-Temporal Feature Pyramid Network[J]. Journal of System Simulation, 2019, 31(11): 2382-2387.

参考文献

[1] Herath S, Harandi M, Porikli F.Going Deeper into Action Recognition: A survey[J]. Image and Vision Computing (S0262-8856), 2017, 60: 4-21.
[2] Wang H, Schmid C.Action Recognition with Improved Trajectories[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2013: 3551-3558.
[3] Karpathy A, Toderici G, Shetty S, et al.Large-scale Video Classification with Convolutional Neural Networks[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
[4] Simonyan K, Zisserman A.Two-stream Convolutional Networks for Action Recognition in Videos[C]. Advances in Neural Information Processing Systems. 2014: 568-576.
[5] Ji S, Xu W, Yang M, et al.3D Convolutional Neural Networks for Human Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence (S0162-8828), 2012, 35(1): 221-231.
[6] Tran D, Bourdev L, Fergus R, et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]. Proceedings of the IEEE International Conference on Computer Vision. 2015: 4489-4497.
[7] Srivastava N, Mansimov E, Salakhudinov R.Unsupervised Learning of Video Representations using LSTMs[C]. International Conference on Machine Learning. 2015: 843-852.
[8] Carreira J, Zisserman A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299-6308.
[9] Shou Z, Wang D, Chang S F.Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1049-1058.
[10] Shou Z, Chan J, Zareian A, et al.CDC: Convolutional-de-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5734-5743.
[11] Wang L, Qiao Y, Tang X.Action Recognition and Detection by Combining Motion and Appearance Features[J]. THUMOS14 Action Recognition Challenge, 2014, 1(2): 2.
[12] Oneata D, Verbeek J, Schmid C.The Lear Submission at Thumos 2014[J]. 2013.
[13] Richard A, Gall J.Temporal Action Detection Using a Statistical Language Model[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 3131-3140.
[14] Yeung S, Russakovsky O, Mori G, et al.End-to-end Learning of Action Detection from Frame Glimpses in Videos[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2678-2687.
[15] Zeiler M D, Fergus R.Visualizing and Understanding Convolutional Networks[C]. European Conference on Computer Vision. Springer, Cham. 2014: 818-833.
[16] Zeiler M D, Krishnan D, Taylor G W, et al.Deconvolutional Networks[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2010, 10: 7.
[17] Lin T Y, Dollár P, Girshick R, et al.Feature Pyramid Networks for Object Detection[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2117-2125.

基于时空特征金字塔网络的动作时序检测方法

A Temporal Action Detection Algorithm Based on Spatio-Temporal Feature Pyramid Network

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价

[1]	曹建芳, 崔红艳, 张自邦, 赵爱迪. 结合迁移学习和底层特征的古代壁画分类模型[J]. 系统仿真学报, 2021, 33(5): 1095-1103.
[2]	王远明, 张珺, 秦远辉, 柴秀娟. 基于多特征融合的指挥手势识别方法研究[J]. 系统仿真学报, 2019, 31(2): 346-352.
[3]	王仕民, 叶继华, 王明文, 左家莉, 刘长红. 基于多尺度区域协方差的显著性特征提取方法[J]. 系统仿真学报, 2018, 30(7): 2767-2775.
[4]	马楠, 石祥滨, 代钦, 刘翠微, 刘芳. 一种音乐舞蹈视频关键帧提取方法[J]. 系统仿真学报, 2018, 30(7): 2801-2807.
[5]	王仕民, 叶继华, 程柏良, 王明文. 基于多尺度张量空间的改进Itti视觉显著性检测[J]. 系统仿真学报, 2016, 28(9): 2138-2145.
[6]	叶继华, 兰清平, 刘长红, 王仕民. 结合互信息量和Log-Gabor特征的嵌入式人脸识别[J]. 系统仿真学报, 2016, 28(9): 2214-2219.