系统仿真学报 ›› 2024, Vol. 36 ›› Issue (11): 2616-2630.doi: 10.16182/j.issn1004731x.joss.23-0926

• 研究论文 • 上一篇    

全局信息感知与局部特征融合的高效三维目标检测

鲁斌1,2, 王明晗1,2, 孙洋1,2, 杨振宇1,2   

  1. 1.华北电力大学 计算机系,河北 保定 071003
    2.河北省能源电力知识计算重点实验室,河北 保定 071003
  • 收稿日期:2023-07-21 修回日期:2023-09-12 出版日期:2024-11-13 发布日期:2024-11-19
  • 通讯作者: 王明晗
  • 第一作者简介:鲁斌(1975-),男,教授,博导,博士,研究方向为智能计算与计算机视觉、综合能源系统与大数据分析。
  • 基金资助:
    河北省重点研发计划(20310103D);河北省在读研究生创新能力培养资助项目(CXZZBS2023153)

Global-local Fusion for Efficient 3D Object Detection

Lu Bin1,2, Wang Minghan1,2, Sun Yang1,2, Yang Zhenyu1,2   

  1. 1.Department of Computer, North China Electric Power University, Baoding 071003, China
    2.Key Laboratory of Energy and Electric Power Knowledge Calculation in Hebei Province, Baoding 071003, China
  • Received:2023-07-21 Revised:2023-09-12 Online:2024-11-13 Published:2024-11-19
  • Contact: Wang Minghan

摘要:

针对基于点云的三维目标检测中存在的特征提取能力不足和检测头分类与回归不一致问题,提出基于SECOND网络的ResCST架构。该模型在三维稀疏卷积层中引入残差连接结合 SwinTransformer 捕捉长距离依赖关系的能力和卷积神经网络获取局部特征的优势,提出CNN-SwinTransformer 混合模型,有效提升特征表达能力;提出 RCIoU 方法,并将其应用于回归和分类分支,实现了分类与回归任务的联合优化。实验结果表明,在自动驾驶数据集 KITTI汽车类别检测中,该模型在简单、中等和困难难度级别下的三维检测精度分别达到了91.21%、82.97%和80.28%。所提方法对困难目标检测效果提升明显,可达到每秒25帧的推理速度。所提出的 ResCST 架构在精度与速率之间取得了较好的平衡。

关键词: 三维目标检测, 点云, 特征融合, 注意力机制, 车辆检测, 体素化, 自动驾驶

Abstract:

As the 3D object detection based on point clouds shows an incapacity of feature extraction and incongruity between classification and regression, this research introduces a novel ResCST architecture based on the SECOND network. It incorporates residual connections into the 3D sparse convolutional layer, with the advantages of capturing long-distance dependent relation by SwinTransformer and obtaining local features by convolutional neural network integrated, proposing the CNN-SwinTransformer hybrid model for enhanced feature extraction. It introduces the RCIoU method for the joint optimization of classification and regression tasks. The experimental results show that the model achieves a 3D detection accuracy of 91.21%, 82.97%, and 80.28% under easy, moderate, and hard levels in detecting cars of the KITTI dataset respectively. The proposed method significantly improves the performance of detecting hard-level targets at an inference speed of 25 frames per second. The proposed ResCST architecture achieves a good balance between accuracy and efficiency.

Key words: 3D object detection, point cloud, feature fusion, attention mechanism, vehicle detection, voxelization, autonomous driving

中图分类号: