基於深度學習方法之相機定位

Yu-Lin Tsai; 蔡侑霖

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8215

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪一平(Yi-Ping Hung)
dc.contributor.author	Yu-Lin Tsai	en
dc.contributor.author	蔡侑霖	zh_TW
dc.date.accessioned	2021-05-20T00:50:12Z	-
dc.date.available	2020-08-24
dc.date.available	2021-05-20T00:50:12Z	-
dc.date.copyright	2020-08-24
dc.date.issued	2020
dc.date.submitted	2020-08-14
dc.identifier.citation	[1] D. G. Lowe, 'Distinctive image features from scale-invariant keypoints,' International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004. [2] A. Kendall, M. Grimes, and R. Cipolla, 'Posenet: A convolutional network for real-time 6-dof camera relocalization,' in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938-2946. [3] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, 'Camera relocalization by computing pairwise relative poses using convolutional neural network,' in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 929-938. [4] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, 'From coarse to fine: Robust hierarchical localization at large scale,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12716-12725. [5] T. Sattler, B. Leibe, and L. Kobbelt, 'Efficient effective prioritized matching for large-scale image-based localization,' IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1744-1756, 2016. [6] T. Sattler et al., 'Are large-scale 3D models really necessary for accurate visual localization?,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1637-1646. [7] M. A. Fischler and R. C. Bolles, 'Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,' Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981. [8] E. Brachmann et al., 'Dsac-differentiable ransac for camera localization,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6684-6692. [9] E. Brachmann and C. Rother, 'Learning less is more-6d camera localization via 3d surface regression,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4654-4662. [10] H. Taira et al., 'InLoc: Indoor visual localization with dense matching and view synthesis,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7199-7209. [11] L. Liu, H. Li, and Y. Dai, 'Efficient global 2d-3d matching for camera localization in a large-scale 3d map,' in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2372-2381. [12] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, 'Scene coordinate regression forests for camera relocalization in RGB-D images,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930-2937. [13] D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. Torr, 'Random forests versus Neural Networks—What's best for camera localization?,' in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017: IEEE, pp. 5118-5125. [14] H. Jégou, M. Douze, C. Schmid, and P. Pérez, 'Aggregating local descriptors into a compact image representation,' in 2010 IEEE computer society conference on computer vision and pattern recognition, 2010: IEEE, pp. 3304-3311. [15] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, 'An efficient k-means clustering algorithm: Analysis and implementation,' IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881-892, 2002. [16] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, 'NetVLAD: CNN architecture for weakly supervised place recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297-5307. [17] V. Balntas, S. Li, and V. Prisacariu, 'Relocnet: Continuous metric learning relocalisation using neural nets,' in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 751-767. [18] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, 'Understanding the limitations of cnn-based absolute camera pose regression,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3302-3312. [19] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, 'Image-based localization using lstms for structured feature correlation,' in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 627-637. [20] H. Sak, A. W. Senior, and F. Beaufays, 'Long short-term memory recurrent neural network architectures for large scale acoustic modeling,' 2014. [21] J. F. Henriques and A. Vedaldi, 'Mapnet: An allocentric spatial memory for mapping environments,' in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8476-8484. [22] A. Valada, N. Radwan, and W. Burgard, 'Deep auxiliary learning for visual localization and odometry,' in 2018 IEEE international conference on robotics and automation (ICRA), 2018: IEEE, pp. 6939-6946. [23] N. Radwan, A. Valada, and W. Burgard, 'Vlocnet++: Deep multitask learning for semantic visual localization and odometry,' IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4407-4414, 2018. [24] E. Rosten and T. Drummond, 'Machine learning for high-speed corner detection,' in European conference on computer vision, 2006: Springer, pp. 430-443. [25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, 'ORB: An efficient alternative to SIFT or SURF,' in 2011 International conference on computer vision, 2011: Ieee, pp. 2564-2571. [26] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, 'Brief: Binary robust independent elementary features,' in European conference on computer vision, 2010: Springer, pp. 778-792. [27] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, 'ORB-SLAM: a versatile and accurate monocular SLAM system,' IEEE transactions on robotics, vol. 31, no. 5, pp. 1147-1163, 2015. [28] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic, 'Neighbourhood consensus networks,' in Advances in Neural Information Processing Systems, 2018, pp. 1651-1662. [29] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, 'Lift: Learned invariant feature transform,' in European Conference on Computer Vision, 2016: Springer, pp. 467-483. [30] D. DeTone, T. Malisiewicz, and A. Rabinovich, 'Superpoint: Self-supervised interest point detection and description,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 224-236. [31] D. Eigen, C. Puhrsch, and R. Fergus, 'Depth map prediction from a single image using a multi-scale deep network,' in Advances in neural information processing systems, 2014, pp. 2366-2374. [32] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, 'Deeper depth prediction with fully convolutional residual networks,' in 2016 Fourth international conference on 3D vision (3DV), 2016: IEEE, pp. 239-248. [33] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, 'Unsupervised learning of depth and ego-motion from video,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851-1858. [34] Y. Zou, Z. Luo, and J.-B. Huang, 'Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,' in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 36-53. [35] J. Bian et al., 'Unsupervised scale-consistent depth and ego-motion learning from monocular video,' in Advances in neural information processing systems, 2019, pp. 35-45. [36] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, '24/7 place recognition by view synthesis,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1808-1817. [37] T. Sattler et al., 'Benchmarking 6dof outdoor visual localization in changing conditions,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8601-8610. [38] F. Radenović, G. Tolias, and O. Chum, 'CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples,' in European conference on computer vision, 2016: Springer, pp. 3-20. [39] F. Radenović, G. Tolias, and O. Chum, 'Fine-tuning CNN image retrieval with no human annotation,' IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655-1668, 2018. [40] Q. Zhou, T. Sattler, M. Pollefeys, and L. Leal-Taixe, 'To Learn or Not to Learn: Visual Localization from Essential Matrices,' arXiv preprint arXiv:1908.01293, 2019. [41] D. Nistér, 'An efficient solution to the five-point relative pose problem,' IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 6, pp. 756-770, 2004. [42] C. Godard, O. Mac Aodha, and G. J. Brostow, 'Unsupervised monocular depth estimation with left-right consistency,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270-279. [43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, 'Image quality assessment: from error visibility to structural similarity,' IEEE transactions on image processing, vol. 13, no. 4, pp. 600-612, 2004. [44] P.-E. Sarlin, F. Debraine, M. Dymczyk, R. Siegwart, and C. Cadena, 'Leveraging deep visual descriptors for hierarchical efficient localization,' arXiv preprint arXiv:1809.01019, 2018. [45] R. A. Newcombe et al., 'KinectFusion: Real-time dense surface mapping and tracking,' in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, 2011: IEEE, pp. 127-136. [46] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, 'Scannet: Richly-annotated 3d reconstructions of indoor scenes,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5828-5839. [47] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, 'Large-scale image retrieval with attentive deep local features,' in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3456-3465.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8215	-
dc.description.abstract	基於影像的定位是希望透過影像資訊推測相機自我位置的問題，同時，對於自駕車、擴增實境、智慧機器人來說是一個關鍵且基礎的技術。近年來，隨著算力的提升和深度學習的發展，許多研究嘗試利用卷積網路強大的特徵描述能力來幫助相機自我定位。然而，當使用場域改變時，這些方法都必須花很多力氣和時間重新訓練其模型，同時顯示其泛化能力相當受到限制。基於圖像搜索概念的定位架構提升了在不同場景的泛化能力，但在預測相機相對位置時而會受限於場景。我們基於圖像檢索的概念提出了一個相機定位的架構，在計算相機相對位置時討論了更多傳統的空間幾何。同時，我們也嘗試用深度學習的方法預測影像深度資訊並加強了我們方法的定位精準度。實驗結果顯示我們的方法和現在最先進的方法有並駕齊驅的定位能力，此外，利用模型壓縮讓我們的定位流程能達到幾乎即時運行。因此，我們認為融合傳統相機方法和深度學習是一個相當有潛力的發展方向。	zh_TW
dc.description.abstract	Image-based localization is used to estimate the camera poses within a specific scene coordinate, which is a fundamental technology towards augmented reality, autonomous driving, or mobile robotics. As the advancement of deep learning, end-to-end approaches based on convolutional neural networks have been well developed. However, these methods suffer from the overhead of reconstructing models while been applied to unseen scene. Therefore, image retrieval-based localization approaches have been proposed with generalization capability. In this paper, we follow the concept of image retrieval-based methods and adopt traditional geometry calculation while performing relative pose estimation. We also use the depth information predicted from deep learning methods to enhance the localization performance. The experimental result in indoor dataset shows the state-of-the-art accuracy. Furthermore, by distilling and sharing the encoder of global and local feature, we make our system possible for real-time application. Our method shows great potential to leverage traditional geometric knowledge and deep learning methods.	en
dc.description.provenance	Made available in DSpace on 2021-05-20T00:50:12Z (GMT). No. of bitstreams: 1 U0001-1308202017410300.pdf: 11315996 bytes, checksum: 54aef6834ee65bf379679fb988579353 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 i 中文摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vi LIST OF TABLES vii Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 6-DoF Visual Localization 4 2.1.1 Structure-based Localization 4 2.1.2 Image Retrieval Localization 4 2.1.3 Absolute Camera Pose Regression 6 2.2 Local Features 6 2.3 Depth Estimation 7 Chapter 3 Method 9 3.1 Pipeline 9 3.2 Image Retrieval 10 3.3 Feature Extraction and Matching 11 3.4 Relative Pose Estimation 12 3.4.1 2D-2D case 12 3.4.2 2D-3D case 16 3.5 Depth Estimation 17 3.5.1 Depth Estimation Framework 17 3.5.2 Loss Term 18 3.6 Model Distillation 20 Chapter 4 Experiments 21 4.1 Dataset 21 4.2 Localization Result 21 4.2.1 Image Retrieval 22 4.2.2 2D-2D case 22 4.2.3 2D-3D case 24 4.3 Computational Cost 26 4.4 Depth Estimation 27 4.5 Generalization 28 Chapter 5 Conclusion 29 Chapter 6 Future Work 30 REFERENCES 31
dc.language.iso	en
dc.title	基於深度學習方法之相機定位	zh_TW
dc.title	Camera Re-Localization with Deep Learning Methods	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	莊仁輝(Jen-Hui Chuang),郭景明(Jing-Ming Guo),歐陽明(Ming Ouhyoung),李明穗(Ming-Sui Lee)
dc.subject.keyword	基於影像的定位,相機定位,深度學習,擴增實境,	zh_TW
dc.subject.keyword	Image-based localization,Camera pose estimation,Deep learning,Augmented Reality,	en
dc.relation.page	36
dc.identifier.doi	10.6342/NTU202003304
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2020-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-1308202017410300.pdf	11.05 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。