投影子空间下基于骨骼边信息的人体动作识别

doi:10.16182/j.issn1004731x.joss.22-1234

摘要/Abstract

摘要：

近年来，基于骨骼数据的人体动作识别在计算机视觉、人机交互等领域受到了广泛的关注。现有的方法大多关注于在原始的3D坐标空间下对骨骼点进行建模。然而，骨骼点忽略了人体自身的物理链状结构，很难刻画人体运动的局部相关性；此外，由于相机视角的多样性，在原始的基于点的3D空间下难以探索动作在不同视角下的综合表征。鉴于此，提出了一种投影子空间下基于骨骼边信息的动作识别方法。定义了结合人体自身连接的骨骼边信息，用于捕获动作的空间特性；在骨骼边信息的基础上引入了骨骼边运动的方向与大小信息，用于获取动作的时间特性；采用2D投影子空间的方式在不同的子空间视角下进行动作表征；探索了合适的特征融合策略，通过改进的CNN框架对上述特征进行综合提取。在2个具有挑战性的大型数据集NTU-RGB+D 60(评价指标为cross-subject与cross-view)和NTU-RGB+D 120(评价指标为cross-subject与cross-set)上的实验结果表明，相比基准方法，所提方法在4个指标下精度分别提升了3.2%、2.4%、3.1%和5.8%。

关键词: 骨骼数据, 骨骼边, 边方向, 边大小, 投影子空间

Abstract:

In recent years, human action recognition based on skeleton data has received a lot of attention in the fields of computer vision and human-computer interaction. Most of the existing methods focus on modeling the skeleton points in the original 3D coordinate space. However, skeleton points ignore the physical chain structure of the human body itself, which makes it difficult to portray the local correlation of human motion. In addition, due to the diversity of camera views, it is difficult to explore the comprehensive representation of actions in different views under the original point-based 3D space. In view of this, this paper proposed an action recognition method based on skeleton edge information in the projection subspace. The method defined skeleton edge information combined with the body's own connection for capturing the spatial characteristics of the action. The direction and size information of skeleton edge motion was introduced on the basis of the skeleton edge information for capturing the temporal characteristics of the action. The 2D projection subspace was used for action characterization under different subspace perspectives. A suitable feature fusion strategy was explored, and the above features were extracted comprehensively through the improved CNN framework. Experimental results on two challenging large datasets NTU-RGB+D 60 (evaluation metrics are cross-subject and cross-view) and NTU-RGB+D 120 (evaluation metrics are cross-subject and cross-set) show that compared with the benchmark method, the proposed method improves the accuracy under the four metrics by 3.2%, 2.4%, 3.1%, and 5.8%, respectively.

Key words: skeleton data, skeleton edges, edge direction, edge size, projection subspace

中图分类号:

TP391.4

苏本跃,张鹏,朱邦国等 . 投影子空间下基于骨骼边信息的人体动作识别[J]. 系统仿真学报, 2024, 36(3): 555-563.

Su Benyue,Zhang Peng,Zhu Bangguo,et al . Human Action Recognition Based on Skeleton Edge Information Under Projection Subspace[J]. Journal of System Simulation, 2024, 36(3): 555-563.

图/表 8

图1

图2

图3

图4

表1

表2

表3

表4

参考文献 29

1	Trelinski Jacek, Kwolek Bogdan. CNN-based and DTW Features for Human Activity Recognition on Depth Maps[J]. Neural Computing and Applications, 2021, 33(21): 14551-14563.
2	刘云, 薛盼盼, 李辉, 等. 基于深度学习的关节点行为识别综述[J]. 电子与信息学报, 2021, 43(6): 1789-1802.
	Liu Yun, Xue Panpan, Li Hui, et al. A Review of Action Recognition Using Joints Based on Deep Learning[J]. Journal of Electronics & Information Technology, 2021, 43(6): 1789-1802.
3	苏本跃, 孙满贞, 马庆, 等. 单视角下基于投影子空间视图的动作识别方法[J]. 系统仿真学报, 2023, 35(5): 1098-1108.
	Su Benyue, Sun Manzhen, Ma Qing, et al. Action Recognition Method Based on Projection Subspace Views under Single Viewing Angle[J]. Journal of System Simulation, 2023, 35(5): 1098-1108.
4	Du Yong, Fu Yun, Wang Liang. Skeleton Based Action Recognition with Convolutional Neural Network[C]//2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR). Piscataway, NJ, USA: IEEE, 2015: 579-583.
5	赵瑛, 陆耀, 张健, 等. 基于深度神经网络的多视角人体动作识别[J]. 系统仿真学报, 2021, 33(5): 1019-1030.
	Zhao Ying, Lu Yao, Zhang Jian, et al. Multi-view Human Action Recognition Based on Deep Neural Network[J]. Journal of System Simulation, 2021, 33(5): 1019-1030.
6	Xu Kailin, Ye Fanfan, Zhong Qiaoyong, et al. Topology-aware Convolutional Neural Network for Efficient Skeleton-based Action Recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2022: 2866-2874.
7	Weinland Daniel, Özuysal Mustafa, Fua Pascal. Making Action Recognition Robust to Occlusions and Viewpoint Changes[C]//Computer Vision-ECCV 2010. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010: 635-648.
8	Shahroudy Amir, Liu Jun, Tian Tsong Ng, et al. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2016: 1010-1019.
9	Liu Jun, Shahroudy Amir, Perez Mauricio, et al. NTU RGB+D 120: A Large-scale Benchmark for 3D Human Activity Understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
10	Li Chao, Zhong Qiaoyong, Xie Di, et al. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2018: 786-792.
11	Zhang Pengfei, Lan Cuiling, Xing Junliang, et al. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data[C]//2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE, 2017: 2136-2145.
12	Kim T S, Reiter A. Interpretable 3D Human Action Analysis with Temporal Convolutional Networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway, NJ, USA: IEEE, 2017: 1623-1631.
13	Zhang Pengfei, Xue Jianru, Lan Cuiling, et al. Adding Attentiveness to the Neurons in Recurrent Neural Networks[C]//Computer Vision-ECCV 2018. Cham: Springer International Publishing, 2018: 136-152.
14	Li Lin, Zheng Wu, Zhang Zhaoxiang, et al. Relational Network for Skeleton-based Action Recognition[EB/OL]. (2019-04-11) [2022-08-30]. .
15	Yan Sijie, Xiong Yuanjun, Lin Dahua. Spatial Temporal Graph Convolutional Networks for Skeleton-based Action Recognition[C]//Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2018: 912.
16	Li Maosen, Chen Siheng, Chen Xu, et al. Actional-structural Graph Convolutional Networks for Skeleton-based Action Recognition[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2019: 3590-3598.
17	Qin Yang, Mo Lingfei, Li Chenyang, et al. Skeleton-based Action Recognition by Part-aware Graph Convolutional Networks[J]. The Visual Computer, 2020, 36(3): 621-631.
18	Si Chenyang, Jing Ya, Wang Wei, et al. Skeleton-based Action Recognition with Hierarchical Spatial Reasoning and Temporal Stack Learning Network[J]. Pattern Recognition, 2020, 107: 107511.
19	Zhuang Tianming, Zhao Pengbiao, Xiao Peng, et al. Multi-stream CNN-LSTM Network with Partition Strategy for Human Action Recognition[C]//Proceedings of the 2021 International Conference on Bioinformatics and Intelligent Computing. New York, NY, USA: Association for Computing Machinery, 2021: 431-435.
20	Chen Han, Jiang Yifan, Ko Hanseok. Action Recognition with Domain Invariant Features of Skeleton Image[C]//2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Piscataway, NJ, USA: IEEE, 2021: 1-7.
21	Guan Shannan, Lu Haiyan, Zhu Linchao, et al. AFE-CNN: 3D Skeleton-based Action Recognition with Action Feature Enhancement[J]. Neurocomputing, 2022, 514: 256-267.
22	Liu Jun, Shahroudy Amir, Xu Dong, et al. Spatio-temporal LSTM with Trust Gates for 3D Human Action Recognition[C]//Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 816-833.
23	Liu Jun, Wang Gang, Duan Lingyu, et al. Skeleton-based Human Action Recognition with Global Context-Aware Attention LSTM Networks[J]. IEEE Transactions on Image Processing, 2018, 27(4): 1586-1599.
24	Yan Sijie, Xiong Yuanjun, Lin Dahua. Spatial Temporal Graph Convolutional Networks for Skeleton-based Action Recognition[C]//Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2018: 912.
25	Li Maosen, Chen Siheng, Chen Xu, et al. Actional-structural Graph Convolutional Networks for Skeleton-based Action Recognition[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2019: 3590-3598.
26	Liu Xing, Li Yanshan, Xia Rongjie. Adaptive Multi-view Graph Convolutional Networks for Skeleton-based Action Recognition[J]. Neurocomputing, 2021, 444: 288-300.
27	Liu Mengyuan, Liu Hong, Chen Chen. Enhanced Skeleton Visualization for View Invariant Human Action Recognition[J]. Pattern Recognition, 2017, 68: 346-362.
28	Ke Qiuhong, Bennamoun Mohammed, An Senjian, et al. Learning Clip Representations for Skeleton-based 3D Action Recognition[J]. IEEE Transactions on Image Processing, 2018, 27(6): 2842-2855.
29	Zhang Pengfei, Lan Cuiling, Zeng Wenjun, et al. Semantics-guided Neural Networks for Efficient Skeleton-based Human Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, 2020: 1109-1118.

实验方法	NTU-RGB+D 60		NTU-RGB+D 120
实验方法	cs	cv	cs	cset
骨骼点	80.6	86.2	72.3	73.2
骨骼边	81.6	87.3	73.1	75.0

实验方法	NTU-RGB+D 60/%		NTU-RGB+D 120/%		NTU-RGB+D 60 (for cs metric)
实验方法	cs	cv	cs	cset	one epoch training time for models/s
骨骼边	81.6	87.3	73.1	75.0	13
骨骼边方向	84.2	90.5	77.5	79.8	17
骨骼边大小	86.1	91.2	78.1	81.0	23
投影子空间	87.7	91.5	79.6	82.4	49

方法	年份	cs/%	cv/%
Deep LSTM^[8]	2016	60.7	67.3
VA-LSTM^[11]	2017	79.2	87.7
TCN^[12]	2018	74.3	83.1
ElAtt-GRU^[13]	2018	80.7	88.4
ARRN-GRU^[14]	2018	80.7	88.8
ST-GCN^[15]	2018	81.5	88.3
AS-GCN^[16]	2019	86.8	94.2
PA-GCN^[17]	2020	80.4	82.7
LSTM+GCN^[18]	2020	84.8	90.2
CNN+LSTM^[19]	2021	79.2	85.6
DIF-CNN^[20]	2021	81.0	85.8
AFE-CNN^[21]	2022	86.2	92.2
HCN^[10] (base)	2018	84.5	89.1
本文方法		87.7	91.5

方法	年份	cs/%	cset/%
ST-LSTM^[22]	2016	56.5	54.1
GCA-LSTM^[23]	2017	58.3	59.2
Two-stream network^[9]	2019	62.2	61.8
ST-GCN^[24]	2018	70.7	73.2
AS-GCN^[25]	2019	77.7	78.9
STF-GCN^[26]	2021	76.7	79.0
Synthesized CNN^[27]	2017	60.3	63.2
Clips+CNN+MTCN^[28]	2018	^62.2	61.8
SGN^[29]	2020	79.2	81.5
AFE-CNN^[21]	2022	80.4	81.6
HCN^[10](base)	2018	76.5	76.6
本文方法		79.6	82.4