Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88102
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor陳文進zh_TW
dc.contributor.advisorWen-Chin Chenen
dc.contributor.author曾靖渝zh_TW
dc.contributor.authorChing-Yu Tsengen
dc.date.accessioned2023-08-08T16:18:33Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-08-
dc.date.issued2023-
dc.date.submitted2023-07-14-
dc.identifier.citation[1] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G.Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
[3] W. Chang, Y. Zhang, and Z. Xiong. Transformer-based monocular depth estimation with attention supervision. 2021.
[4] Y. Chen, S. Liu, X. Shen, and J. Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12536–12545, 2020.
[5] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao. Graphdetr3d: Rethinking overlapping regions for multiview 3d object detection. arXiv preprint arXiv:2204.11582, 2022.
[6] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1000–1001, 2020.
[7] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2485–2494, 2020.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 770–778, 2016.
[9] A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V. Badrinarayanan, R. Cipolla, and A. Kendall. Fiery: Future instance prediction in bird’s eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
[10] J. Huang and G. Huang. Bevdet4d: Exploit temporal cues in multicamera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
[11] J. Huang, G. Huang, Z. Zhu, and D. Du. Bevdet: Highperformance multicamera 3d object detection in bird-eye view. arXiv preprint arXiv:2112.11790, 2021.
[12] K.C. Huang, T.H. Wu, H.T. Su, and W. H. Hsu. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4012–4021, 2022.
[13] A. Kumar, G. Brazil, E. Corona, A. Parchami, and X. Liu. Deviant: Depth equivariant network for monocular 3d object detection. arXiv preprint arXiv:2207.10758, 2022.
[14] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
[15] Y. Lee, J.w. Hwang, S. Lee, Y. Bae, and J. Park. An energy and GPU computation efficient backbone network for realtime object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
[16] Y. Lee and J. Park. Centermask: Realtime anchorfree instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2020.
[17] P. Li, X. Chen, and S. Shen. Stereo rcnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7644–7652, 2019.
[18] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai. Bevformer: Learning birdseye view representation from multicamera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022.
[19] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[20] Y. Liu, T. Wang, X. Zhang, and J. Sun. Petr: Position embedding transformation for multiview 3d object detection. arXiv preprint arXiv:2203.05625, 2022.
[21] Y. Liu, J. Yan, F. Jia, S. Li, Q. Gao, T. Wang, X. Zhang, and J. Sun. Petrv2: A unified framework for 3d perception from multicamera images. arXiv preprint arXiv:2206.01256, 2022.
[22] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
[23] C. Lu, M. J. G. van de Molengraft, and G. Dubbelman. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics and Automation Letters, 4(2):445–452, 2019.
[24] J. Mao, S. Shi, X. Wang, and H. Li. 3d object detection for autonomous driving: A review and new outlooks. arXiv preprint arXiv:2206.09474, 2022.
[25] B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou. Crossview semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
[26] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon. Is pseudolidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021.
[27] L. Peng, Z. Chen, Z. Fu, P. Liang, and E. Cheng. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. arXiv preprint arXiv:2203.04050, 2022.
[28] J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European Conference on Computer Vision, pages 194–210. Springer, 2020.
[29] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8555–8564, 2021.
[30] T. Roddick, A. Kendall, and R. Cipolla. Orthographic feature transforms for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018.
[31] D. Rukhovich, A. Vorontsova, and A. Konushin. Imvoxelnet: Image to voxels projection for monocular and multiview general purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
[32] A. Saha, O. Mendez, C. Russell, and R. Bowden. Translating images into maps. In 2022 International Conference on Robotics and Automation (ICRA), pages 9200–9206. IEEE, 2022.
[33] Y. Tang, S. Dorn, and C. Savani. Center3d: Centerbased monocular 3d object detection with joint depth understanding. In DAGM German Conference on Pattern Recognition, pages 289–302. Springer, 2020.
[34] Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
[35] T. Wang, Z. Xinge, J. Pang, and D. Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022.
[36] T. Wang, X. Zhu, J. Pang, and D. Lin. Fcos3d: Fully convolutional onestage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
[37] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon. Detr3d:3d object detection from multiview images via 3dto2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
[38] Y. Wang, X. Zhang, T. Yang, and J. Sun. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2567–2575, 2022.
[39] X. Weng and K. Kitani. Monocular 3d object detection with pseudolidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[40] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
[41] T. Yin, X. Zhou, and P. Krahenbuhl. Center based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
[42] Y. You, Y. Wang, W.L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q.Weinberger. Pseudolidar++: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310, 2019.
[43] R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao, and H. Li. Monodetr: Depthaware transformer for monocular 3d object detection. arXiv preprint arXiv:2203.13310, 2022.
[44] Y. Zhang, X. Ma, S. Yi, J. Hou, Z. Wang, W. Ouyang, and D. Xu. Learning geometry guided depth via projective modeling for monocular 3d object detection. arXiv preprint arXiv:2107.13931, 2021.
[45] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu. Beverse: Unified perception and prediction in birdseye view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
[46] B. Zhou and P. Krähenbühl. Crossview transformers for real-time map view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13760–13769, 2022.
[47] X. Zhou, D. Wang, and P. Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
[48] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu. Classbalanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
[49] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88102-
dc.description.abstract為了在自駕車中以低成本實現準確的三維物體檢測,許多多目相機方法被提出來解決單相機方法中的遮擋問題。然而,由於缺乏準確的深度估計,現有的多目相機方法通常會在深度方向的射線上因難以檢測的小型物體(如行人)而預測多個邊界框,導致召回率極低。此外,直接將通常由大型網絡結構組成的深度預測模塊應用於現有的多目相機方法,無法滿足自駕車應用的即時預測要求。為了解決這些問題,我們提出了用於深度引導跨視角多目相機三維物體檢測(CrossDTR)。首先,我們設計了輕量級的「深度預測器」,以在監督過程中生成精確的物體稀疏深度圖和低維深度嵌入向量,而無需額外的深度數據集來監督。其次,我們開發了一個「深度引導跨視角多目變換器」,用於融合來自不同相機視角的深度嵌入和影像特徵,並生成三維邊界框。廣泛的實驗表明,我們的方法在行人檢測方面總共超過現有的多目相機方法10%,在整體平均精度(mAP)和標准化檢測得分(NDS)指標方面超過約3%。此外,計算分析顯示,我們的方法比先前的方法快5倍。我們的代碼將在https://github.com/sty61010/CrossDTR 公開提供。zh_TW
dc.description.abstractTo achieve accurate 3D object detection at a low cost for autonomous driving, many multi-camera methods have been proposed and solved the occlusion problem of monocular approaches. However, due to the lack of accurate estimated depth, existing multi-camera methods often generate multiple bounding boxes along a ray of depth direction for difficult small objects such as pedestrians, resulting in an extremely low recall. Furthermore, directly applying depth prediction modules to existing multi-camera methods, generally composed of large network architectures, cannot meet the real-time requirements of self-driving applications. To address these issues, we propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our lightweight \textit{depth predictor} is designed to produce precise object-wise sparse depth maps and low-dimensional depth embeddings without extra depth datasets during supervision. Second, a \textit{cross-view depth-guided transformer} is developed to fuse the depth embeddings as well as image features from cameras of different views and generate 3D bounding boxes. Extensive experiments demonstrated that our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics. Also, computational analyses showed that our method is 5 times faster than prior approaches. Our codes will be made publicly available at https://github.com/sty61010/CrossDTR.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-08T16:18:33Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-08T16:18:33Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
摘要 iii
Abstract v
Contents vii
List of Figures ix
List of Tables xi
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.0.1 Monocular 3D Object Detection . . . . . . . . . . . . . . . . . . . 5
2.0.2 MultiCamera 3D Object Detection . . . . . . . . . . . . . . . . . . 6
2.0.3 Depthguided Monocular Methods . . . . . . . . . . . . . . . . . . 6
Chapter 3 Method 9
3.0.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.0.2 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.0.3 Objectwise Sparse Depth Map . . . . . . . . . . . . . . . . . . . . 10
3.0.4 Depth Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.0.5 Crossview and Depthguided Transformer . . . . . . . . . . . . . . 12
3.0.6 3D Detection Head and Loss . . . . . . . . . . . . . . . . . . . . . 13
Chapter 4 Experiments 15
4.0.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.0.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.0.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.0.4 False Positive Predictions Results . . . . . . . . . . . . . . . . . . 18
4.0.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 5 Conclusion 21
References 23
-
dc.language.isoen-
dc.subject電腦視覺zh_TW
dc.subject自駕車zh_TW
dc.subject物件偵測zh_TW
dc.subjectObject Detectionen
dc.subjectComputer Visionen
dc.subjectAutonomous Drivingen
dc.title深度引導跨視角多目相機三維物體檢測zh_TW
dc.titleCrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detectionen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.coadvisor徐宏民zh_TW
dc.contributor.coadvisorWinston H.Hsuen
dc.contributor.oralexamcommittee陳駿丞;陳奕廷;葉梅珍zh_TW
dc.contributor.oralexamcommitteeJun-Cheng Chen;Yi-Ting Chen;Mei-Chen Yehen
dc.subject.keyword電腦視覺,自駕車,物件偵測,zh_TW
dc.subject.keywordComputer Vision,Autonomous Driving,Object Detection,en
dc.relation.page29-
dc.identifier.doi10.6342/NTU202301047-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2023-07-17-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-111-2.pdf5.84 MBAdobe PDFView/Open
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved