Object Detection of Lightweight Transformer Based on Knowledge Distillation

doi:10.16182/j.issn1004731x.joss.24-0754

Abstract

Abstract:

In autonomous driving, the efficiency and accuracy of object detection are significant. Object detection based on Transformer structure has gradually become the mainstream method, eliminating the complex anchor generation and non-maximum suppression (NMS). It has problems of high computing cost and slow convergence. An object detection model of the based lightweight pooling transformer (LPT) is designed, which contains a pooling backbone network and dual pooling attention mechanism. A general knowledge distillation method is intended for the DETR (detection transformer) model, which transfers prediction results, query vector, and features extracted by the teacher as knowledge to the LPT model to improve its accuracy. To verify the application potential of the distilled LPT model in autonomous driving, extensive experiments are conducted on the MS COCO 2017 dataset. The results show that the method has great efficiency and accuracy, and is competitive with some advanced techniques.

Key words: object detection, knowledge distillation, lightweight, DETR(detection Transformer), Transformer, autonomous driving

CLC Number:

TP391.9

Wang Gaihua, Li Kehong, Long Qian, Yao Jingxuan, Zhu Bolun, Zhou Zhengshu, Pan Xuran. Object Detection of Lightweight Transformer Based on Knowledge Distillation[J]. Journal of System Simulation, 2024, 36(11): 2517-2527.

Figures/Tables 10

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Table 1

Fig. 5

Fig. 6

Fig. 7

Table 2

Table 3

References 28

1	卢裕秋, 孙金玉, 马世伟. 基于深度卷积神经网络的运动目标检测方法[J]. 系统仿真学报, 2019, 31(11): 2275-2280.
	Lu Yuqiu, Sun Jinyu, Ma Shiwei. Moving Object Detection Based on Deep Convolutional Neural Network[J]. Journal of System Simulation, 2019, 31(11): 2275-2280.
2	张稀柳, 张晓玲, 何敏军. 基于改进YOLOX-s的车辆检测方法研究[J]. 系统仿真学报, 2024, 36(2): 487-496.
	Zhang Xiliu, Zhang Xiaoling, He Minjun. Research on Vehicle Detection Method Based on Improved YOLOX-s[J]. Journal of System Simulation, 2024, 36(2): 487-496.
3	石敏, 姚瀚钦, 李淳芃, 等. 基于深度Alignment网络的足部测量[J]. 系统仿真学报, 2020, 32(7): 1267-1278.
	Shi Min, Yao Hanqin, Li Chunpeng, et al. Foot Measurement Based on Deep Alignment Network[J]. Journal of System Simulation, 2020, 32(7): 1267-1278.
4	Girshick R. Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2015: 1440-1448.
5	Liu Wei, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[C]//Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016: 21-37.
6	Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-time Object Detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 779-788.
7	Zhang Shifeng, Chi Cheng, Yao Yongqiang, et al. Bridging the Gap Between Anchor-based and Anchor-free Detection Via Adaptive Training Sample Selection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 9756-9765.
8	Carion N, Massa F, Synnaeve G, et al. End-to-end Object Detection with Transformers[C]//Computer Vision – ECCV 2020. Cham: Springer International Publishing, 2020: 213-229.
9	Zhu Xizhou, Su Weijie, Lu Lewei, et al. Deformable DETR: Deformable Transformers for End-to-end Object Detection[EB/OL]. (2021-03-18) [2023-11-21]. .
10	Dai Xiyang, Chen Yinpeng, Yang Jianwei, et al. Dynamic DETR: End-to-end Object Detection with Dynamic Attention[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 2968-2977.
11	Li Feng, Zhang Hao, Liu Shilong, et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 13609-13617.
12	高昕, 甄国涌, 储成群, 等. 基于改进YOLOv5的自动驾驶目标检测方法[J]. 科学技术与工程, 2024, 24(16): 6757-6765.
	Gao Xin, Zhen Guoyong, Chu Chengqun, et al. Autonomous Driving Target Detection Method Based on Improved YOLOv5[J]. Science Technology and Engineering, 2024, 24(16): 6757-6765.
13	Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network[EB/OL]. (2015-03-09) [2024-01-15]. .
14	Chen Guobin, Choi W, Yu Xiang, et al. Learning Efficient Object Detection Models with Knowledge Distillation[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 742-751.
15	Wang Tao, Yuan Li, Zhang Xiaopeng, et al. Distilling Object Detectors with Fine-grained Feature Imitation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2019: 4928-4937.
16	Sun Ruoyu, Tang Fuhui, Zhang Xiaopeng, et al. Distilling Object Detectors with Task Adaptive Regularization[EB/OL]. (2020-06-23) [2024-02-09]. .
17	Zhang Linfeng, Ma Kaisheng. Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors[C]//ICLR 2021. New York: ICLR, 2020: 1-14.
18	Yang Zhendong, Li Zhe, Jiang Xiaohu, et al. Focal and Global Knowledge Distillation for Detectors[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 4633-4642.
19	朱志豪, 王艳, 纪志成. 基于模型压缩的安瓿瓶外观检测仿真研究[J]. 系统仿真学报, 2022, 34(12): 2575-2583.
	Zhu Zhihao, Wang Yan, Ji Zhicheng. Simulation Research on Appearance Detection of Ampoules Based on Lightweight Network and Model Compression[J]. Journal of System Simulation, 2022, 34(12): 2575-2583.
20	Yao Zhuyu, Ai Jiangbo, Li Boxun, et al. Efficient DETR: Improving End-to-end Object Detector with Dense Prior[EB/OL]. (2021-04-03) [2023-12-28]. .
21	Meng Depu, Chen Xiaokang, Fan Zejia, et al. Conditional DETR for Fast Training Convergence[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 3631-3640.
22	Roh Byungseok, Jae Woong Shin, Shin Wuhyun, et al. Sparse DETR: Efficient End-to-end Object Detection with Learnable Sparsity[EB/OL]. (2022-03-04) [2024-01-06]. .
23	Zhang Hao, Li Feng, Liu Shilong, et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-end Object Detection[EB/OL]. (2022-07-11) [2024-01-18]. .
24	Yu Weihao, Luo Mi, Zhou Pan, et al. MetaFormer is Actually What You Need for Vision[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10809-10819.
25	Romero Adriana, Ballas Nicolas, Samira Ebrahimi Kahou, et al. FitNets: Hints for Thin Deep Nets[EB/OL]. (2015-03-27) [2024-02-21]. .
26	Zheng Zhaohui, Ye Rongguang, Hou Qibin, et al. Localization Distillation for Object Detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 10070-10083.
27	Zhao Yian, Wenyu Lü, Xu Shangliang, et al. DETRs Beat YOLOs on Real-time Object Detection[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2024: 16965-16974.
28	Liu Shilong, Li Feng, Zhang Hao, et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR[EB/OL]. (2022-03-30) [2024-03-07]. .

模型	Pooling Backbone	HPT	参数量/M	权重值/MB	帧率/(帧/s)	AP
RT-DETR	—	—	42.944	164	19.99	47.0
LPTv1	√	—	40.136	153	21.13	45.8
LPTv2	—	√	44.622	170	20.92	46.2
LPTv3	√	√	41.814	159	22.05	45.7

模型	模块	Backbone	AP	AP_S	AP_M	AP_L
Deformable DETR	教师	Resnet-101	45.5	27.5	48.7	60.3
	学生(未蒸馏)	Resnet-50	44.1	27.0	47.4	58.3
	学生(蒸馏)	Resnet-50	46.6	28.5	48.6	61.0
Conditional DETR	教师	Resnet-101	42.4	22.6	46.0	61.2
	学生(未蒸馏)	Resnet-50	40.7	20.3	43.8	60.0
	学生(蒸馏)	Resnet-50	42.9	21.6	46.5	62.2
LPT	教师	HgnetV2	48.1	29.3	51.9	66.4
	学生(未蒸馏)	Pooling backbone	45.7	27.8	49.0	63.8
	学生(蒸馏)	Pooling backbone	48.3	28.9	49.7	65.5

模型	参数量/M	计算复杂度/G	帧率/(帧/s)	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
DETR	41.580	86.556	14.80	15.5	29.4	14.5	4.3	15.1	26.7
DAB-DETR	43.722	90.740	10.80	38.0	60.3	39.8	19.2	40.9	55.4
RT-DETR	42.940	69.157	19.99	47.0	64.6	50.8	28.5	51.1	65.2
Ours(未蒸馏)	41.814	60.949	22.02	45.7	63.4	48.9	27.8	49.0	63.8
Ours(蒸馏)	41.814	60.949	22.02	48.3	64.4	51.2	28.9	49.7	65.5

[1]	Li Dongxue, Liu Yan, Shen Boyao, Jing Yongteng, Ma Qiang, Liu Ran. Carbon Footprint Analysis and Low-carbon Optimization Method Simulation Study of Power Transformer Based on Digital Twin Technology [J]. Journal of System Simulation, 2024, 36(9): 2075-2085.
[2]	Liu Peijin, Fu Xuefeng, Sun Haofeng, He Lin, Liu Shujie. A Highly Robust Target Tracking Algorithm Merging CNN and Transformer [J]. Journal of System Simulation, 2024, 36(8): 1854-1868.
[3]	Li Xiang, Sang Haifeng. Dense Video Description Method Based on Multi-modal Fusion in Transformer Network [J]. Journal of System Simulation, 2024, 36(5): 1061-1071.
[4]	Shi Lanxi, Yan Wenxu, Ni Hongyu, Zhao Feng. Research on Dynamic Scene SLAM Based on Improved Object Detection [J]. Journal of System Simulation, 2024, 36(4): 1028-1042.
[5]	Yang Zhe, Cui Yinghan, Guo Lingxi, Li Jiaxin, Wu Xusheng. Search Technology for Aircraft Debris Integrating Data Augmentation and Deep Learning Algorithm [J]. Journal of System Simulation, 2024, 36(10): 2238-2245.
[6]	Su Tong, Wang Ying, Deng Qiyang, Li Zhaobin. Improved Foggy Pedestrian and Vehicle Detection Algorithm Based on YOLOv5 [J]. Journal of System Simulation, 2024, 36(10): 2413-2422.
[7]	Dong Qingqing, Wu Hao, Qian Wenhua, Kong Fengling. RGB-D Saliency Object Detection Based on Cross-refinement and Circular Attention [J]. Journal of System Simulation, 2023, 35(9): 1931-1947.
[8]	Yang Li, Huijuan Zhang, Chenchen Ge, Kang Xie, Zhuang Li, Jinyuan Jia. Lightweight WebVR Real-Time Simulation of Large-Scale Fire Scenario in Metro [J]. Journal of System Simulation, 2023, 35(3): 646-657.
[9]	Xu Renjie, Zhang Xiaoming, Wang Chen, Wu Peng. Research on 3D Object Detection Method with Cross-module Attention [J]. Journal of System Simulation, 2023, 35(12): 2680-2691.
[10]	Shiqi Lin, Jikai Wang, Haoyuan Pei, Hao Zhao, Zonghai Chen. Monocular Semantic SLAM Method Based on Object Relation Description [J]. Journal of System Simulation, 2022, 34(2): 278-284.
[11]	Zhihao Zhu, Yan Wang, Zhicheng Ji. Simulation Research on Appearance Detection of Ampoules Based on Lightweight Network and Model Compression [J]. Journal of System Simulation, 2022, 34(12): 2575-2583.
[12]	Xuqiang Shao, Haowei Zhang, Xiaohua Feng. Multi-sensory Fusion Method for Power Transformer Virtual Assembly [J]. Journal of System Simulation, 2022, 34(10): 2244-2254.
[13]	Liu Xiaojun, He Changyan, Liu Chang, Jia Jinyuan. Fast Alignment of BIM Products Based on Structure Matching [J]. Journal of System Simulation, 2021, 33(7): 1626-1637.
[14]	Zhang Huijuan, Liu Fan, Wang Dongqing, Jia Jinyuan. Parameterization of Complex Pipeline Meshes and Its Large-scale Online Visualization [J]. Journal of System Simulation, 2020, 32(8): 1489-1497.
[15]	Liu Jiazhe, Chen Chunyi, Hu Xiaojuan, Liang Weidong, Xing Qiwei, Yang Huamin. Mobile-phone-oriented Stereoscopic Display and Interaction Framework for Cloud-based Virtual Reality 3D Scenes [J]. Journal of System Simulation, 2020, 32(7): 1360-1374.