Global-local Fusion for Efficient 3D Object Detection

doi:10.16182/j.issn1004731x.joss.23-0926

Abstract

Abstract:

As the 3D object detection based on point clouds shows an incapacity of feature extraction and incongruity between classification and regression, this research introduces a novel ResCST architecture based on the SECOND network. It incorporates residual connections into the 3D sparse convolutional layer, with the advantages of capturing long-distance dependent relation by SwinTransformer and obtaining local features by convolutional neural network integrated, proposing the CNN-SwinTransformer hybrid model for enhanced feature extraction. It introduces the RCIoU method for the joint optimization of classification and regression tasks. The experimental results show that the model achieves a 3D detection accuracy of 91.21%, 82.97%, and 80.28% under easy, moderate, and hard levels in detecting cars of the KITTI dataset respectively. The proposed method significantly improves the performance of detecting hard-level targets at an inference speed of 25 frames per second. The proposed ResCST architecture achieves a good balance between accuracy and efficiency.

Key words: 3D object detection, point cloud, feature fusion, attention mechanism, vehicle detection, voxelization, autonomous driving

CLC Number:

TP391.9

Lu Bin, Wang Minghan, Sun Yang, Yang Zhenyu. Global-local Fusion for Efficient 3D Object Detection[J]. Journal of System Simulation, 2024, 36(11): 2616-2630.

Figures/Tables 12

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Table 1

Table 2

Table 3

Fig. 7

Table 4

Table 5

References 45

1	Qi R, Su Hao, Mo Kaichun, et al. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2017: 77-85.
2	Qi R, Yi Li, Su Hao, et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2017: 5105-5114.
3	Lang A H, Vora S, Caesar H, et al. PointPillars: Fast Encoders for Object Detection From Point Clouds[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2019: 12689-12697.
4	Zhou Sifan, Tian Zhi, Chu Xiangxiang, et al. FastPillars: A Deployment-friendly Pillar-based 3D Detector[EB/OL]. (2023-02-07) [2023-03-08]. .
5	Shi Guangsheng, Li Ruifeng, Ma Chao. PillarNet: Real-time and High-performance Pillar-based 3D Object Detection[EB/OL]. (2022-08-26) [2023-03-26]. .
6	Yin Tianwei, Zhou Xingyi, Krähenbühl Philipp. Center-based 3D Object Detection and Tracking[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 11779-11788.
7	Zhou Yin, Tuzel O. VoxelNet: End-to-end Learning for Point Cloud Based 3D Object Detection[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4490-4499.
8	Yan Yan, Mao Yuxing, Li Bo. SECOND: Sparsely Embedded Convolutional Detection[J]. Sensors, 2018, 18(10): 3337.
9	Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2017: 6000-6010.
10	Li Jiashi, Xia Xin, Li Wei, et al. Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios[EB/OL]. (2022-08-16) [2023-04-11]. .
11	Li Jiale, Luo Shujie, Zhu Ziqi, et al. 3D IoU-net: IoU Guided 3D Object Detector for Point Clouds[EB/OL]. (2020-04-10) [2023-04-05]. .
12	Ren Shaoqing, He Kaiming, Girshick R, et al. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 91-99.
13	Li Xiang, Wang Wenhai, Wu Lijun, et al. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2020: 21002-21012.
14	Zheng Zhaohui, Wang Ping, Liu Wei, et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12993-13000.
15	Zhou Dingfu, Fang Jin, Song Xibin, et al. IoU Loss for 2D/3D Object Detection[C]//2019 International Conference on 3D Vision (3DV). Piscataway: IEEE, 2019: 85-94.
16	Zheng Wu, Tang Weiliang, Jiang Li, et al. SE-SSD: Self-ensembling Single-stage Object Detector from Point Cloud[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 14489-14498.
17	Sheng Hualian, Cai Sijia, Zhao Na, et al. Rethinking IoU-based Optimization for Single-stage 3D Object Detection[C]//Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022: 544-561.
18	Shi Shaoshuai, Wang Xiaogang, Li Hongsheng. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2019: 770-779.
19	Yang Zetong, Sun Yanan, Liu Shu, et al. 3DSSD: Point-based 3D Single Stage Object Detector[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 11037-11045.
20	Pan Xuran, Xia Zhuofan, Song Shiji, et al. 3D Object Detection with Pointformer[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 7459-7468.
21	Shi Weijing, Rajkumar R. Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 1708-1716.
22	Ge Runzhou, Ding Zhuangzhuang, Hu Yihan, et al. AFDet: Anchor Free One Stage 3D Object Detection[EB/OL]. (2020-06-30) [2023-04-26]. .
23	Hu Yihan, Ding Zhuangzhuang, Ge Runzhou, et al. AFDetV2: Rethinking the Necessity of the Second Stage for Object Detection from Point Clouds[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 969-979.
24	Zheng Wu, Tang Weiliang, Chen Sijin, et al. CIA-SSD: Confident IoU-aware Single-stage Object Detector from Point Cloud[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 3555-3562.
25	Fan Lue, Pang Ziqi, Zhang Tianyuan, et al. Embracing Single Stride 3D Object Detector with Sparse Transformer[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 8448-8458.
26	Zou Jiayu, Tian Kun, Zhu Zheng, et al. DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(7), 7846-7854.
27	Li Bo, Zhang Tianlei, Xia Tian. Vehicle Detection from 3D Lidar Using Fully Convolutional Network[EB/OL]. (2016-08-29) [2023-05-31]. .
28	Beltrán Jorge, Guindel Carlos, Francisco Miguel Moreno, et al. BirdNet: A 3D Object Detection Framework from LiDAR Information[C]//2018 21st International Conference on Intelligent Transportation Systems (ITSC). Piscataway: IEEE, 2018: 3517-3523.
29	Wang Tai, Zhu Xinge, Lin Dahua. Reconfigurable Voxels: A New Representation for LiDAR-based Point Clouds[C]//Proceedings of the 2020 Conference on Robot Learning. Chia Laguna Resort: PMLR, 2021: 286-295.
30	Wu Hai, Wen Chenglu, Li Wei, et al. Transformation-equivariant 3D Object Detection for Autonomous Driving[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(3): 2795-2802.
31	Wu Xiaopei, Peng Liang, Yang Honghui, et al. Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 5408-5417.
32	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[EB/OL]. (2021-06-03) [2023-05-28]. .
33	Zhou Yin, Sun Pei, Zhang Yu, et al. End-to-end Multi-view Fusion for 3D Object Detection in LiDAR Point Clouds[C]//Proceedings of the Conference on Robot Learning. Chia Laguna Resort: PMLR, 2020: 923-932.
34	Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 9992-10002.
35	Zhao Hengshuang, Jiang Li, Jia Jiaya, et al. Point Transformer[EB/OL]. (2021-09-26) [2023-05-13]. .
36	Mao Jiageng, Xue Yujing, Niu Minzhe, et al. Voxel Transformer for 3D Object Detection[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 3144-3153.
37	He Chenhang, Li Ruihuang, Li Shuai, et al. Voxel Set Transformer: A Set-to-set Approach to 3D Object Detection from Point Clouds[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 8407-8417.
38	Chen Xuanyao, Liu Zhijian, Tang Haotian, et al. SparseViT: Revisiting Activation Sparsity for Efficient High-resolution Vision Transformer[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 2061-2070.
39	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778.
40	Xia Xin, Li Jiashi, Wu Jie, et al. TRT-ViT: TensorRT-oriented Vision Transformer[EB/OL]. (2022-07-12) [2023-06-01]. .
41	Li Yanyu, Yuan Geng, Wen Yang, et al. EfficientFormer: Vision Transformers at MobileNet Speed[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2022: 12934-12949.
42	Li Xiang, Wang Wenhai, Hu Xiaolin, et al. Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 11627-11636.
43	Geiger Andreas, Lenz Philip, Urtasun R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2012: 3354-3361.
44	Sheng Hualian, Cai Sijia, Liu Yuan, et al. Improving 3D Object Detection with Channel-wise Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 2723-2732.
45	Xu Qiangeng, Zhong Yiqi, Neumann U. Behind the Curtain: Learning Occluded Shapes for 3D Object Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3), 2893-2901.

环境	参数
GPU	NVIDIA RTX 3090
显存	24 GB
Python 版本	3.8
深度学习框架	PyTorch 1.8.1
CUDA 版本	11.1
cuDNN版本	8.0
SpConv	2.3.3

Method	Stage	Car_AP_3D(IoU=0.7)%
Method	Stage	Easy	Moderate	Hard
PointRCNN^[18]	Two	85.94	75.76	68.32
Point-GNN^[21]		88.33	79.47	72.29
VoTr-TSD^[36]		89.90	82.09	79.14
CT3D^[44]		87.83	81.77	77.16
BtcDet^[45]		90.64	82.86	78.09
VoxelNet^[7]	One	77.82	65.11	62.85
SECOND^[8]		83.34	73.66	66.20
PointPillar^[3]		86.46	77.28	74.65
SE-SSD^[16]		91.49	82.54	77.15
Ours		91.21	82.97	80.28
Method	Stage	Cyclist_AP_3D(IoU=0.7)%
Method	Stage	Easy	Moderate	Hard
PointRCNN^[18]	Two	92.51	71.89	67.48
Point-GNN^[21]		‒	‒	‒
VoTr-TSD^[36]		‒	‒	‒
CT3D^[44]		89.01	71.88	67.91
BtcDet^[45]		‒	‒	‒
VoxelNet^[7]	One	‒	‒	‒
SECOND^[8]		82.96	66.74	62.78
PointPillar^[3]		81.58	62.94	58.98
SE-SSD^[16]		‒	‒	‒
Ours		87.81	69.94	65.33

Method	FPS	Car_AP_3D(IoU=0.7)%
Method	FPS	Easy	Moderate	Hard
VoxelNet^[7]	4.4	81.97	65.46	62.85
SECOND^[8]	20	87.43	76.48	69.10
PointPillar^[3]	42	86.62	76.06	68.91
SE-SSD^[16]	25	‒	81.71	‒
Ours	25	89.17	78.95	78.04

ResSPConvNet	CSTNet	RCIoU	Car_AP_3D%
ResSPConvNet	CSTNet	RCIoU	Easy	Moderate	Hard
×	×	×	87.43	76.48	69.10
√	×	×	87.90	77.61	76.48
×	√	×	87.77	77.42	76.03
×	×	√	88.66	78.59	77.72
√	√	√	89.87	79.10	78.99

卷积算子	Car_AP_3D%			参数设置
卷积算子	Easy	Moderate	Hard	参数设置
标准卷积	89.87	79.10	78.99	‒
空洞卷积	89.26	78.97	78.26	dilation rate=2
分组卷积	88.77	77.42	77.03	group size = 2