深度引導跨視角多目相機三維物體檢測

曾靖渝; Ching-Yu Tseng

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88102

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳文進	zh_TW
dc.contributor.advisor	Wen-Chin Chen	en
dc.contributor.author	曾靖渝	zh_TW
dc.contributor.author	Ching-Yu Tseng	en
dc.date.accessioned	2023-08-08T16:18:33Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-08	-
dc.date.issued	2023	-
dc.date.submitted	2023-07-14	-
dc.identifier.citation	[1] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G.Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. [2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. [3] W. Chang, Y. Zhang, and Z. Xiong. Transformer-based monocular depth estimation with attention supervision. 2021. [4] Y. Chen, S. Liu, X. Shen, and J. Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12536–12545, 2020. [5] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao. Graphdetr3d: Rethinking overlapping regions for multiview 3d object detection. arXiv preprint arXiv:2204.11582, 2022. [6] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1000–1001, 2020. [7] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2485–2494, 2020. [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 770–778, 2016. [9] A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V. Badrinarayanan, R. Cipolla, and A. Kendall. Fiery: Future instance prediction in bird’s eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021. [10] J. Huang and G. Huang. Bevdet4d: Exploit temporal cues in multicamera 3d object detection. arXiv preprint arXiv:2203.17054, 2022. [11] J. Huang, G. Huang, Z. Zhu, and D. Du. Bevdet: Highperformance multicamera 3d object detection in bird-eye view. arXiv preprint arXiv:2112.11790, 2021. [12] K.C. Huang, T.H. Wu, H.T. Su, and W. H. Hsu. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4012–4021, 2022. [13] A. Kumar, G. Brazil, E. Corona, A. Parchami, and X. Liu. Deviant: Depth equivariant network for monocular 3d object detection. arXiv preprint arXiv:2207.10758, 2022. [14] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. [15] Y. Lee, J.w. Hwang, S. Lee, Y. Bae, and J. Park. An energy and GPU computation efficient backbone network for realtime object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. [16] Y. Lee and J. Park. Centermask: Realtime anchorfree instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2020. [17] P. Li, X. Chen, and S. Shen. Stereo rcnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7644–7652, 2019. [18] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai. Bevformer: Learning birdseye view representation from multicamera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022. [19] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. [20] Y. Liu, T. Wang, X. Zhang, and J. Sun. Petr: Position embedding transformation for multiview 3d object detection. arXiv preprint arXiv:2203.05625, 2022. [21] Y. Liu, J. Yan, F. Jia, S. Li, Q. Gao, T. Wang, X. Zhang, and J. Sun. Petrv2: A unified framework for 3d perception from multicamera images. arXiv preprint arXiv:2206.01256, 2022. [22] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. [23] C. Lu, M. J. G. van de Molengraft, and G. Dubbelman. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics and Automation Letters, 4(2):445–452, 2019. [24] J. Mao, S. Shi, X. Wang, and H. Li. 3d object detection for autonomous driving: A review and new outlooks. arXiv preprint arXiv:2206.09474, 2022. [25] B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou. Crossview semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020. [26] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon. Is pseudolidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021. [27] L. Peng, Z. Chen, Z. Fu, P. Liang, and E. Cheng. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. arXiv preprint arXiv:2203.04050, 2022. [28] J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European Conference on Computer Vision, pages 194–210. Springer, 2020. [29] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8555–8564, 2021. [30] T. Roddick, A. Kendall, and R. Cipolla. Orthographic feature transforms for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018. [31] D. Rukhovich, A. Vorontsova, and A. Konushin. Imvoxelnet: Image to voxels projection for monocular and multiview general purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022. [32] A. Saha, O. Mendez, C. Russell, and R. Bowden. Translating images into maps. In 2022 International Conference on Robotics and Automation (ICRA), pages 9200–9206. IEEE, 2022. [33] Y. Tang, S. Dorn, and C. Savani. Center3d: Centerbased monocular 3d object detection with joint depth understanding. In DAGM German Conference on Pattern Recognition, pages 289–302. Springer, 2020. [34] Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019. [35] T. Wang, Z. Xinge, J. Pang, and D. Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022. [36] T. Wang, X. Zhu, J. Pang, and D. Lin. Fcos3d: Fully convolutional onestage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021. [37] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon. Detr3d:3d object detection from multiview images via 3dto2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022. [38] Y. Wang, X. Zhang, T. Yang, and J. Sun. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2567–2575, 2022. [39] X. Weng and K. Kitani. Monocular 3d object detection with pseudolidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019. [40] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018. [41] T. Yin, X. Zhou, and P. Krahenbuhl. Center based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021. [42] Y. You, Y. Wang, W.L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q.Weinberger. Pseudolidar++: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310, 2019. [43] R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao, and H. Li. Monodetr: Depthaware transformer for monocular 3d object detection. arXiv preprint arXiv:2203.13310, 2022. [44] Y. Zhang, X. Ma, S. Yi, J. Hou, Z. Wang, W. Ouyang, and D. Xu. Learning geometry guided depth via projective modeling for monocular 3d object detection. arXiv preprint arXiv:2107.13931, 2021. [45] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu. Beverse: Unified perception and prediction in birdseye view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022. [46] B. Zhou and P. Krähenbühl. Crossview transformers for real-time map view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13760–13769, 2022. [47] X. Zhou, D. Wang, and P. Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. [48] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu. Classbalanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019. [49] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88102	-
dc.description.abstract	為了在自駕車中以低成本實現準確的三維物體檢測，許多多目相機方法被提出來解決單相機方法中的遮擋問題。然而，由於缺乏準確的深度估計，現有的多目相機方法通常會在深度方向的射線上因難以檢測的小型物體（如行人）而預測多個邊界框，導致召回率極低。此外，直接將通常由大型網絡結構組成的深度預測模塊應用於現有的多目相機方法，無法滿足自駕車應用的即時預測要求。為了解決這些問題，我們提出了用於深度引導跨視角多目相機三維物體檢測（CrossDTR）。首先，我們設計了輕量級的「深度預測器」，以在監督過程中生成精確的物體稀疏深度圖和低維深度嵌入向量，而無需額外的深度數據集來監督。其次，我們開發了一個「深度引導跨視角多目變換器」，用於融合來自不同相機視角的深度嵌入和影像特徵，並生成三維邊界框。廣泛的實驗表明，我們的方法在行人檢測方面總共超過現有的多目相機方法10％，在整體平均精度（mAP）和標准化檢測得分（NDS）指標方面超過約3％。此外，計算分析顯示，我們的方法比先前的方法快5倍。我們的代碼將在https://github.com/sty61010/CrossDTR 公開提供。	zh_TW
dc.description.abstract	To achieve accurate 3D object detection at a low cost for autonomous driving, many multi-camera methods have been proposed and solved the occlusion problem of monocular approaches. However, due to the lack of accurate estimated depth, existing multi-camera methods often generate multiple bounding boxes along a ray of depth direction for difficult small objects such as pedestrians, resulting in an extremely low recall. Furthermore, directly applying depth prediction modules to existing multi-camera methods, generally composed of large network architectures, cannot meet the real-time requirements of self-driving applications. To address these issues, we propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our lightweight \textit{depth predictor} is designed to produce precise object-wise sparse depth maps and low-dimensional depth embeddings without extra depth datasets during supervision. Second, a \textit{cross-view depth-guided transformer} is developed to fuse the depth embeddings as well as image features from cameras of different views and generate 3D bounding boxes. Extensive experiments demonstrated that our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics. Also, computational analyses showed that our method is 5 times faster than prior approaches. Our codes will be made publicly available at https://github.com/sty61010/CrossDTR.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-08T16:18:33Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-08T16:18:33Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i 摘要 iii Abstract v Contents vii List of Figures ix List of Tables xi Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.0.1 Monocular 3D Object Detection . . . . . . . . . . . . . . . . . . . 5 2.0.2 MultiCamera 3D Object Detection . . . . . . . . . . . . . . . . . . 6 2.0.3 Depthguided Monocular Methods . . . . . . . . . . . . . . . . . . 6 Chapter 3 Method 9 3.0.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.0.2 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.0.3 Objectwise Sparse Depth Map . . . . . . . . . . . . . . . . . . . . 10 3.0.4 Depth Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.0.5 Crossview and Depthguided Transformer . . . . . . . . . . . . . . 12 3.0.6 3D Detection Head and Loss . . . . . . . . . . . . . . . . . . . . . 13 Chapter 4 Experiments 15 4.0.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.0.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.0.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.0.4 False Positive Predictions Results . . . . . . . . . . . . . . . . . . 18 4.0.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 5 Conclusion 21 References 23	-
dc.language.iso	en	-
dc.title	深度引導跨視角多目相機三維物體檢測	zh_TW
dc.title	CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	徐宏民	zh_TW
dc.contributor.coadvisor	Winston H.Hsu	en
dc.contributor.oralexamcommittee	陳駿丞;陳奕廷;葉梅珍	zh_TW
dc.contributor.oralexamcommittee	Jun-Cheng Chen;Yi-Ting Chen;Mei-Chen Yeh	en
dc.subject.keyword	電腦視覺,自駕車,物件偵測,	zh_TW
dc.subject.keyword	Computer Vision,Autonomous Driving,Object Detection,	en
dc.relation.page	29	-
dc.identifier.doi	10.6342/NTU202301047	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-07-17	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	5.84 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。