基于余弦相似性的定向注意力行为识别模型

doi:10.16182/j.issn1004731x.joss.22-0937

摘要/Abstract

摘要：

针对传统点积注意力缺乏方向性的问题，建立了一种基于余弦相似性的定向注意力模型(directed attention model， DAM)。为有效表示视频帧时空特征间的方向关系，运用余弦相似性理论，定义了注意力机制中关系函数，能够去除特征间关系绝对值；为降低注意力机制计算量，从时间和空间两个维度上对运算进行分解；结合线性注意力运算，进一步优化计算复杂度。实验分为两个阶段：对定向注意力各模块开展了4个消融实验，以表现DAM在精确度和效率方面的最佳性能；该模型在Sth-Sth V1(something something V1)数据集上的精确度较I3D-NL(inflated 3D ConvNet non-local)高7.3%，在UCF101(101 human action classes from videos in the wild)数据集上的识别精确率为95.7%。研究成果在安全监控、自动驾驶等方面应用前景广泛。

关键词: 行为识别, 深度学习, 注意力机制, 余弦相似性, 时空分解

Abstract:

Aiming at the lack of directionality of traditional dot product attention, this paper proposes a directed attention model (DAM) based on cosine similarity. To effectively represent the direction relationship between the spatial and temporal features of video frames, the paper defines the relationship function in the attention mechanism using the cosine similarity theory, which can remove the absolute value of the relationship between features. To reduce the computational burden of the attention mechanism, the operation is decomposed from two dimensions of time and space. The computational complexity is further optimized by combining linear attention operation.The experiment is divided into two stages : Four ablation experiments are carried out on each module of directed attention to show the best performance of DAM in accuracy and efficiency; the accuracy of the model is 7.3% higher than that of I3D-NL on the Sth-Sth V1(something something V1) dataset and 95.7% on the UCF101(101 human action classes from videos in the wild) dataset. The research results have a wide application prospect in safety monitoring, automatic driving, and so on.

Key words: action recognition, deep learning, attentional mechanism, cosine similarity, time-space decomposition

中图分类号:

TP391.4

李晨,何明,董晨等 . 基于余弦相似性的定向注意力行为识别模型[J]. 系统仿真学报, 2024, 36(1): 67-82.

Li Chen,He Ming,Dong Chen,et al . Action Recognition Model of Directed Attention Based on Cosine Similarity[J]. Journal of System Simulation, 2024, 36(1): 67-82.

图/表 14

表 1

图1

图 2

图 3

表 2

ResNet50-C2D架构

名称	输出尺寸	卷积核	步长
输入	32×224×224
Conv₁	16×112×112	1×7×7，64	1，2，2
Pool₁	8×56×56	3×3×3，max	1，2，2
Res₁	8×56×56	$1 × 1 × 1, 64 1 × 3 × 3, 64 1 × 1 × 1, 256 × 3$	1，1，1
Pool₂	4×56×56	3×1×1, max	2，1，1
Res₂	4×28×28	$1 × 1 × 1, 128 1 × 3 × 3, 128 1 × 1 × 1, 512 × 4$	1，2，2
Res₃	4×14×14	$1 × 1 × 1, 256 1 × 3 × 3, 256 1 × 1 × 1, 1024 × 6$	1，2，2
Res₄	4×7×7	$1 × 1 × 1, 512 1 × 3 × 3, 512 1 × 1 × 1, 2048 × 3$	1,2,2
Pool₃	1×1×1	4×7×7, average	1，1，1
Fc	1×1×1	2048×class	1，1，1

表 2

图4

表3

表 4

图 5

图 6

表 5

表 6

表 7

图 7

参考文献 41

1	Wang Xiaolong, Girshick R, Gupta A, et al. Non-local Neural Networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE, 2018: 7794-7803.
2	Truong T D, Bui Quoc-Huy, Chi Nhan Duong, et al. DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2022: 19998-20008.
3	Babiloni F, Marras I, Kokkinos F, et al. Poly-NL: Linear Complexity Non-local Layers with 3rd Order Polynomials[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2021: 10498-10508.
4	Wang Heng, Schmid Cordelia. Action Recognition with Improved Trajectories[C]//2013 IEEE International Conference on Computer Vision. Piscataway, NJ, USA: IEEE, 2013: 3551-3558.
5	Simonyan K, Zisserman A. Two-stream Convolutional Networks for Action Recognition in Videos[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2014: 568-576.
6	Feichtenhofer Christoph, Pinz Axel, Zisserman A. Convolutional Two-stream Network Fusion for Video Action Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2016: 1933-1941.
7	Feichtenhofer Christoph, Pinz Axel, Wildes Richard P. Spatiotemporal Residual Networks for Video Action Recognition[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2016: 3476-3484.
8	Feichtenhofer Christoph, Pinz Axel, Wildes Richard P. Spatiotemporal Multiplier Networks for Video Action Recognition[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2017: 7445-7454.
9	Tran D, Bourdev L, Fergus R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks[C]//2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2015: 4489-4497.
10	Carreira João, Zisserman A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2017: 4724-4733.
11	Lin Ji, Gan Chuang, Han Song. TSM: Temporal Shift Module for Efficient Video Understanding[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2019: 7082-7092.
12	Zhou Bolei, Andonian A, Oliva A, et al. Temporal Relational Reasoning in Videos[C]//Computer Vision-ECCV 2018. Cham: Springer International Publishing, 2018: 831-846.
13	Wang Limin, Xiong Yuanjun, Wang Zhe, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[C]//Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 20-36.
14	Qiu Zhaofan, Yao Ting, Mei Tao. Learning Spatio-temporal Representation with Pseudo-3D Residual Networks[C]//2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2017: 5534-5542.
15	Xie Saining, Sun Chen, Huang J, et al. Rethinking Spatiotemporal Feature Learning: Speed-accuracy Trade-offs in Video Classification[C]//Computer Vision-ECCV 2018. Cham: Springer International Publishing, 2018: 318-335.
16	Tran D, Wang Heng, Feiszli M, et al. Video Classification with Channel-separated Convolutional Networks[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2019: 5551-5560.
17	Zolfaghari Mohammadreza, Singh Kamaljeet, Brox Thomas. ECO: Efficient Convolutional Network for Online Video Understanding[C]//Computer Vision-ECCV 2018. Cham: Springer International Publishing, 2018: 713-730.
18	Feichtenhofer C. X3D: Expanding Architectures for Efficient Video Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2020: 200-210.
19	Wang Xianyuan, Miao Zhenjiang, Zhang Ruyi, et al. I3D-LSTM: A New Model for Human Action Recognition[C]//IOP Conference Series: Materials Science and Engineering. Bristol, United Kingdom: IOP Publishing, 2019: 032035.
20	Tran D, Ray J, Shou Z, et al. Convnet Architecture Search for Spatiotemporal Feature Learning[J]. Computing Research Repository, 2017, 16(8): 1-12.
21	Li Yan, Ji Bin, Shi Xintian, et al. TEA: Temporal Excitation and Aggregation for Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2020: 906-915.
22	罗会兰, 陈翰. 时空卷积注意力网络用于动作识别[J]. 计算机工程与应用, 2023, 59(9): 150-158.
	Luo Huilan, Chen Han. Spatial-temporal Convolutional Attention Network for Action Recognition[J]. Computer Engineering and Applications, 2023, 59(9): 150-158.
23	吴丽君, 李斌斌, 陈志聪, 等. 3D多重注意力机制下的行为识别[J]. 福州大学学报(自然科学版), 2022, 50(1): 47-53.
	Wu Lijun, Li Binbin, Chen Zhicong, et al. Action Recognition Under 3D Multiple Attention Mechanism[J]. Journal of Fuzhou University(Natural Science Edition), 2022, 50(1): 47-53.
24	Zhu Yi, Lan Zhenzhong, Newsam S, et al. Hidden Two-stream Convolutional Networks for Action Recognition[C]//Computer Vision-ACCV 2018. Cham: Springer International Publishing, 2019: 363-378.
25	Feichtenhofer C, Fan Haoqi, Malik J, et al. SlowFast Networks for Video Recognition[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2019: 6201-6210.
26	Wang Limin, Tong Zhan, Ji Bin, et al. TDN: Temporal Difference Networks for Efficient Action Recognition[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2021: 1895-1904.
27	Ma C Y, Chen M H, Kira Z, et al. TS-LSTM and Temporal-inception: Exploiting Spatiotemporal Dynamics for Activity Recognition[J]. Signal Processing: Image Communication, 2019, 71: 76-87.
28	Neimark D, Bar O, Zohar M, et al. Video Transformer Network[C]//2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Piscataway, NJ, USA: IEEE, 2021: 3156-3165.
29	Li Kunchang, Wang Yali, Gao Peng, et al. UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning[EB/OL]. (2022-02-08) [2022-04-04]. .
30	Arnab A, Dehghani M, Heigold G, et al. ViViT: A Video Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2021: 6816-6826.
31	Fan Haoqi, Xiong Bo, Mangalam K, et al. Multiscale Vision Transformers[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2021: 6804-6815.
32	Li Yanghao, Wu Chaoyuan, Fan Haoqi, et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection[EB/OL]. (2022-03-30) [2022-04-04]. .
33	Alfasly S, Lu Jian, Xu Chen, et al. Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-specific Annotated Videos[EB/OL]. (2022-03-27) [2022-04-04]. .
34	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2016: 770-778.
35	Kay W, Carreira João, Simonyan K, et al. The Kinetics Human Action Video Dataset[EB/OL]. (2017-05-19) [2022-04-04]. .
36	Goyal Raghav, Samira Ebrahimi Kahou, Michalski Vincent, et al. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense[C]//2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2017: 5843-5851.
37	Soomro K, Zamir A R, Shah M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[EB/OL]. (2013-12-03) [2022-04-04]. .
38	Kuehne H, Jhuang H, Garrote E, et al. HMDB: A Large Video Database for Human Motion Recognition[C]//2011 International Conference on Computer Vision. Piscataway, NJ, USA: IEEE, 2011: 2556-2563.
39	Deng Jia, Dong Wei, Socher R, et al. ImageNet: A Large-scale Hierarchical Image Database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE, 2009: 248-255.
40	Wang Xiaolong, Gupta A. Videos as Space-time Region Graphs[C]//Computer Vision-ECCV 2018: 15th European Conference. Heidelberg: Springer-Verlag, 2018: 413-431.
41	Zhou Bolei, Khosla A, Lapedriza A, et al. Learning Deep Features for Discriminative Localization[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2016: 2921-2929.

角度	方法	代表模型	优、缺点
时间特征提取	双流网络	Two-stream Network^[5-8]	可提取时间特征，但计算量高、不稳定、特征局部
	3D CNN	C3D^[9]、I3D^[10]	可提取时空特征，但计算量高，无法提取全局特征
	时态模块	TSM^[11]、TRN^[12]、NLNN^[1]	可灵活提取时空特征，但无法高效联系全局信息
高效率优化	输入数据优化	TSN^[13]	可降低效率，但影响识别准确性
	时空分解卷积	P3D^[14]、S3D^[15]	可降低效率，但不利于模型最优迭代
	深度分离卷积	CSN^[16]	可降低效率，但缺少跨通道信息
	混合卷积	ECO^[17]、X3D^[18]	可降低效率，但前期训练工作较困难
全局特征捕获	全局均匀采样	TSN^[13]	可捕获时间全局特征，但缺乏对空间全局特征
	LSTM	I3D-LSTM^[19]	增强全局表征，但训练效率较低
	自注意力机制	NLNN^[1]、DirectFormer^[2]	可提取全局特征，但缺乏特征方向性和运行效率

模块	参数量/MB	GFLOPS×View	Top1/ %	Top5/ %
基准模型	24.18	26.19×10×3	71.76	89.85
DP-NL^[1]	25.56	26.47×10×3	72.68	90.47
Poly-NL^[3]	25.33	26.20×10×3	72.65	90.46
DA-NL-1	25.53	27.02×10×3	73.21	90.73
DA-NL-2	25.57	27.04×10×3	73.18	90.71

位置	模块	参数量/MB	GFLOPS×View	Top1/ %	Top5/ %
	基准模型	24.18	26.19×10×3	71.76	89.85
Res₁	DA-NL	24.27	237.92×10×3	72.94	90.43
	DA-DA	25.19	52.72×10×3	72.72	90.35
	DA-Poly	24.83	26.26×10×3	72.43	90.27
	DA-DP	25.22	35.08×10×3	72.46	90.26
	Poly-DA	24.83	52.66×10×3	72.68	90.32
	DP-DA	25.22	52.68×10×3	72.65	90.31
Res₂	DA-NL	24.51	32.81×10×3	73.07	90.58
	DA-DA	26.23	27.85×10×3	72.94	90.41
	DA-Poly	26.08	26.20×10×3	72.68	90.32
	DA-DP	26.25	26.75×10×3	72.65	90.31
	Poly-DA	26.08	27.84×10×3	72.90	90.38
	DP-DA	26.25	27.84×10×3	72.91	90.37
Res₃	DA-NL	25.53	27.02×10×3	73.21	90.73
	DA-DA	27.98	26.40×10×3	72.84	90.53
	DA-Poly	27.35	26.19×10×3	72.53	90.39
	DA-DP	30.02	26.26×10×3	72.49	90.37
	Poly-DA	27.35	26.39×10×3	72.76	90.40
	DP-DA	30.02	26.39×10×3	72.74	90.39
Res₄	DA-NL	29.68	26.29×10×3	72.51	90.25
	DA-DA	36.18	26.22×10×3	72.34	90.15
	DA-Poly	35.82	26.19×10×3	72.01	89.98
	DA-DP	36.24	26.20×10×3	72.04	90.02
	Poly-DA	35.82	26.22×10×3	72.28	90.12
	DP-DA	36.24	26.22×10×3	72.31	90.10

模型	主干架构	预训练	初始学习率	Dropout	动量	权重衰减	帧数	分辨率	周期	Batch
C2D^[1]	ResNet50	I+K400	0.001	0.5	0.9	5×10^-4	32	224×224	100	16
TSN^[13]	BNInception	I	0.020	0.8	0.9	1×10^-4	8	224×224	50	8
TSN^[13]	ResNet50	I	0.020	0.8	0.9	1×10^-4	8	224×224	50	8
C3D^[1]	ResNet50	I+K400	0.001	0.5	0.9	5×10^-4	32	224×224	100	16
ECO^[17]	BNIncep+R18	I+K400	0.001	0.5	0.9	5×10^-4	8	224×224	100	32
I3D^[40]	ResNet50	I+K400	0.001 25	0.3	0.9	1×10^-4	32	224×224	100	8
I3D+NL^[40]	ResNet50	I+K400	0.001 25	0.3	0.9	1×10^-4	32	224×224	100	8
S3D-G^[15]	BN-Inception	I	0.100	0.5	0.9	1×10^-4	64	224×224	100	6
TRN^[12]	BNInception	I	0.001	0.5	0.9	1×10^-4	8	224×224	100	10
TRN^[12]	ResNet50	I	0.001	0.5	0.9	1×10^-4	8	224×224	100	10
TSM^[11]	ResNet50	I+K400	0.010	0.5	0.9	1×10^-4	8	224×224	50	64
TEA*^[21]	ResNet50	I					16	224×224
TDN^[26]	ResNet50	I+S1M	0.020	0.5	0.9	1×10^-4	8	224×224	60	128
DAM	ResNet50	I+K400	0.001	0.5	0.9	5×10^-4	32	224×224	100	16

模型	主干架构	帧数×View	分辨率	参数量/MB	GFLOPS×View	Top1	Top5
C2D^[1]	ResNet50	32×1×1	224×224	24.18	26.19×1×1	18.7	45.3
TSN^[13]	BNInception	8×1×1	224×224	10.70	16×1×1	19.5
TSN^[13]	ResNet50	8×1×1	224×224	24.30	33×1×1	19.7	46.6
C3D^[1]	ResNet50	32×1×1	224×224	28.51	35.67×1×1	32.8	60.3
ECO^[17]	BNIncep+R18	8×1×1	224×224	47.50	32×1×1	39.6
I3D^[40]	ResNet50	32×2×1	224×224	28.00	153×2×1	41.6	72.2
I3D+NL^[40]	ResNet50	32×2×1	224×224	35.30	168×2×1	44.4	76.0
S3D-G^[15]	BN-Inception	64×1×1	224×224	11.56	71.38×1×1	48.2	78.7
TRN^[12]	BNInception	8×1×1	224×224	18.30	16×1×1	34.4
TRN^[12]	ResNet50	8×1×1	224×224	31.80	33×1×1	38.9	68.1
TSM^[11]	ResNet50	8×1×1	224×224	24.30	33×1×1	45.6	74.2
TEA^[21]	ResNet50	16×1×1	224×224		70×1×1	51.9	80.3
TDN^[26]	ResNet50	8×1×1	224×224		36×1×1	52.3	80.6
DAM	ResNet50	8×1×1	224×224	40.38	9.09×1×1	48.2	76.5
DAM	ResNet50	16×1×1	224×224	40.38	18.18×1×1	49.5	78.3
DAM	ResNet50	32×1×1	224×224	40.38	36.36×1×1	51.7	80.1