請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91444
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 丁建均 | zh_TW |
dc.contributor.advisor | Jian-Jiun Ding | en |
dc.contributor.author | 鄭閔軒 | zh_TW |
dc.contributor.author | Min-Hsuan Cheng | en |
dc.date.accessioned | 2024-01-26T16:32:08Z | - |
dc.date.available | 2024-01-27 | - |
dc.date.copyright | 2024-01-26 | - |
dc.date.issued | 2024 | - |
dc.date.submitted | 2024-01-18 | - |
dc.identifier.citation | [1] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, pp. 91–110, 2004.
[2] H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded Up Robust Features," in Computer Vision -- ECCV 2006, A. Leonardis et al. (Eds.), Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. [3] Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, "Learning Feature Descriptors using Camera Pose Supervision," in Proc. European Conference on Computer Vision (ECCV), 2020. [4] K. Li, L. Wang, L. Liu, Q. Ran, K. Xu, and Y. Guo, "Decoupling Makes Weakly Supervised Local Feature Better," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15817-15827, 2022. [5] F. Yu and V. Koltun, "Multi-Scale Context Aggregation by Dilated Convolutions," arXiv:1511.07122 [cs.CV], 2016. [6] C. Wang, R. Xu, Y. Zhang, S. Xu, W. Meng, B. Fan, and X. Zhang, "MTLDesc: Looking Wider to Describe Better," in AAAI, AAAI Press, 2022. [7] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, "LoFTR: Detector-Free Local Feature Matching with Transformers," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8918-8927, 2021. [8] M. Tyszkiewicz, P. Fua, and E. Trulls, "DISK: Learning local features with policy gradient," in Advances in Neural Information Processing Systems, vol. 33, H. Larochelle et al. (Eds.), pp. 14254-14265, Curran Associates, Inc., 2020. [9] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, "HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3852-3861, 2017. [10] S. Zhai, W. Talbott, N. Srivastava, C. Huang, H. Goh, R. Zhang, and J. Susskind, "An Attention-Free Transformer," arXiv:2105.14103 [cs.LG], 2021. [11] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, "SuperGlue: Learning Feature Matching With Graph Neural Networks," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4937-4946, 2020. [12] D. DeTone, T. Malisiewicz, and A. Rabinovich, "SuperPoint: Self-Supervised Interest Point Detection and Description," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 337-33712, 2018. [13] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, "From Coarse to Fine: Robust Hierarchical Localization at Large Scale," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12708-12717, 2019. [14] Z. Zhang, T. Sattler, and D. Scaramuzza, "Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis," International Journal of Computer Vision, vol. 129, no. 4, pp. 821–844, Dec. 2020. [15] Z. Li and N. Snavely, "MegaDepth: Learning Single-View Depth Prediction from Internet Photos," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2041-2050, 2018. [16] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation," in Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015, N. Navab et al. (Eds.), Springer International Publishing, Cham, 2015, pp. 234-241. [17] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016. [18] Mishkin, Dmytro, Filip Radenovic, and Jiri Matas. "Repeatability is not enough: Learning affine regions via discriminability." Proceedings of the European conference on computer vision (ECCV). 2018. p. 284-300. [19] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, "D2-Net: A Trainable CNN for Joint Description and Detection of Local Features," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8084-8093, 2019. [20] J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, "R2D2: Reliable and Repeatable Detector and Descriptor," in Advances in Neural Information Processing Systems, vol. 32, H. Wallach et al. (Eds.), Curran Associates, Inc., 2019. [21] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, "ASLFeat: Learning Local Features of Accurate Shape and Localization," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6588-6597, 2020. [22] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, "Working hard to know your neighbor''s margins: Local descriptor learning loss," in Advances in Neural Information Processing Systems, vol. 30, I. Guyon et al. (Eds.), Curran Associates, Inc., 2017. [23] R. Arandjelović and A. Zisserman, "Three things everyone should know to improve object retrieval," in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911-2918, 2012. [24] M. A. Fischler and R. C. Bolles, "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography," Communications of the ACM, vol. 24, no. 6, pp. 381-395, June 1981. [25] L. Chi, B. Jiang, and Y. Mu, "Fast Fourier Convolution," in Advances in Neural Information Processing Systems, vol. 33, H. Larochelle et al. (Eds.), Curran Associates, Inc., 2020, pp. 4479-4488. [26] J.-B. Cordonnier, A. Loukas, and M. Jaggi, "Multi-Head Attention: Collaborate Instead of Concatenate," ArXiv, vol. abs/2006.16362, 2020. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91444 | - |
dc.description.abstract | 隨著電腦視覺領域的發展,如何有效地描述影像資訊成為重要的課題。然而特徵提取就是其中一種被用來描述影像的方式,同時也被運用在許多任務,像是物件偵測、物件追蹤、圖像拼接、三維視覺重建及定位等任務。在過去有許多不同的特徵提取方式被提出來做應用,像是基於圖像紋理、邊緣、形狀等基本特徵並且利用梯度或差值等數學關係描述的手工特徵提取方式。又或者是近期較熱門基於深度學習的特徵描述方式,其利用多層卷積神經網路或是注意力機制,有效的去獲得局部及全局的影像資訊,並且利用一定量的資料協助訓練,讓神經網路最佳化其結果。以上兩種方式皆是常見的特徵描述方式,然而近期基於深度學習的方式在許多資料集上都有不錯的表現,所以我們會基於現行資料集的基準再提出新的特徵提取方式,並且優於現有基準的結果。
首先我們使用的是稀疏特徵提取方式,並且利用基於相機視角的弱監督學習方式,來找到密集的對應關係,但因弱監督學習方式無法明確計算的描述及檢測步驟的損失,所以這邊因此會導入基於相機視角的前處理,並且將描述步驟以及檢測步驟解耦訓練,採用先描述得到穩健且具判別力的描述子後再訓練檢測器,以取得更有鑑別力的描述點位置。 整體模型架構能分成以下三個部分,首先基於相機視角的前處理,以利描述網路計算損失,可以稱為線至窗口方法,主要利用兩相機內外參數及對極線轉換的概念,來協助優化描述網路的權重。 再者相機描述網路,這邊我們提出了新的架構,更專注於全局的視野感受,並導入了多感受域以及基於注意力的全局資訊補償資訊,在這些模塊的作用下一定程度的提高了整個網路的描述能力。 最後檢測網路部分則是以較簡單的網路架構作特徵點位置的提取,並利用強化學習來優化網路,其中的改進是我們在訓練網路時加入了特徵點位置的對應關係,這邊我們利用了兩影像的對應關係,此操作過程相同於特徵匹配方式下的估計基礎矩陣的方式,利用隨機抽樣一致性算法計算出兩張照片特徵點的對應關係矩陣並,再將估計矩陣與資料集提供的實際值計算兩者間的損失作為獎賞之一,並加上成對照片上取出特徵點的特徵相似度等資訊,完成最後的優化步驟。 我們的模型對比於現行已提出的弱監督模型,能在被廣泛使用的公共資料集上面獲取與其他模型可匹敵的結果,甚至對於照明改變影像的任務上能獲得優於在此前提出模型的結果。 | zh_TW |
dc.description.abstract | With the development of computer vision, effectively describing image information has become an important issue. Feature extraction is one way used to describe images and is applied in various tasks such as object detection, object tracking, image stitching, 3D visual reconstruction, and localization. In the past, different feature extraction methods were proposed, including handcrafted methods based on image texture, edges, shapes, etc., utilizing gradients or mathematical relationships like derivatives for manual feature extraction. Recently, there is a popular trend towards deep learning-based feature descriptors, employing multi-layer convolutional neural networks or attention mechanisms to capture local and global image information effectively. These approaches leverage a substantial amount of data for training to optimize the neural network results. Both of these approaches are common ways of describing features. However, deep learning-based methods have recently shown excellent performance on many datasets. We will prompt the proposal of new feature extraction methods based on the benchmarks of existing datasets, outperforming current standards.
First, we employ a sparse feature extraction method and utilize weakly-supervised learning based on the camera perspective to find dense correspondence relationships. However, due to the inability of weakly supervised learning to explicitly calculate the loss of the description and detection steps, we introduce camera-perspective-based preprocessing. In this process, we decouple the training of the description and detection steps, addressing the challenge of weakly supervised learning by adopting a two-step approach. We first train the description step to obtain robust and discriminative descriptors, and then we train the detector to achieve more discriminative feature point positions based on the obtained descriptors. The overall model architecture can be divided into three parts. Firstly, the camera-perspective-based preprocessing is introduced to facilitate the calculation of the loss in the description network, referred to as the line-to-window method. It primarily uses the concepts of camera intrinsic and extrinsic parameters and epipolar line transformation to help optimize the weights of the description network. Secondly, the camera description network is proposed with a focus on a global field of view and the incorporation of multi-scale receptive field and attention-based global information compensation modules, enhancing the overall network's descriptive capabilities to some extent. Finally, the detection network extracts feature point positions using a relatively simple network architecture and employs reinforcement learning for optimization. An improvement in this step is the inclusion of correspondence relationships between feature point positions during network training. We use the Random Sample Consensus (RANSAC) algorithm to get the fundamental matrix between two images, a process similar to estimating the fundamental matrix under feature matching, then compute the loss between the estimated matrix and the actual values provided by the dataset. This is complemented by information such as feature point similarity extracted from paired images, completing the final optimization step. Our model, compared to existing weakly supervised models, demonstrates comparable results on widely used public datasets. Moreover, it even outperforms previous models in illumination-changed image tasks. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-01-26T16:32:08Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2024-01-26T16:32:08Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 誌謝 i
中文摘要 ii ABSTRACT iv CONTENTS vi LIST OF FIGURES viii LIST OF TABLES xi Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Primary Contributions 3 Chapter 2 Related Works 5 2.1 Weakly-Supervised Learning 6 2.2 Attention Mechanism 6 2.3 Multi-Scale Receptive Field 7 2.4 Camera Pose Estimation 8 2.5 Summary 8 Chapter 3 Proposed Description network and loss design 10 3.1 Network Backbone 11 3.2 Balance Channel Ensemble 12 3.3 Attention Mechanism module 13 3.4 Multi-Scale receptive field module 15 3.5 The Overall Network Architecture 17 3.6 Network Loss 17 3.6.1 Line-to-Window Search 18 3.6.2 Loss Function 20 Chapter 4 Proposed Detection network and loss design 22 4.1 Network structure 22 4.2 Network loss 23 Chapter 5 Network Training Details 27 5.1 Dataset 27 5.2 Training Pipeline 28 5.3 Description Network Optimization Pipeline 28 5.4 Detection Network Optimization Pipeline 29 Chapter 6 Experiments And Comparison 32 6.1 Experiments Details 32 6.2 Inference Result 33 6.3 Image Matching Task 36 6.4 Visual Localization Task 39 6.5 Ablation Study 42 Chapter 7 Conclusion and Future Work 45 REFERENCE 47 | - |
dc.language.iso | en | - |
dc.title | 基於相機視角與注意力機制的多尺寸感受域特徵提取器 | zh_TW |
dc.title | Multi-Scale Receptive Field Feature Extractor with Camera Pose and Attention Mechanism | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-1 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 簡鳳村;許文良;曾易聰 | zh_TW |
dc.contributor.oralexamcommittee | Feng-Tsun Chien;Wen-Liang Hsue;Yi-Chong Zeng | en |
dc.subject.keyword | 特徵提取,弱監督學習,相機視角,對極線,解耦訓練,多感受域,全局注意力,估計基礎矩陣,照明改變影像, | zh_TW |
dc.subject.keyword | Feature extraction,Weakly-supervised learning,Camera perspective,Epipolar line,Decouple training,Multi-scale receptive field,Global attention,Estimate fundamental matrix,Illumination-changed image, | en |
dc.relation.page | 50 | - |
dc.identifier.doi | 10.6342/NTU202400107 | - |
dc.rights.note | 同意授權(全球公開) | - |
dc.date.accepted | 2024-01-22 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 電信工程學研究所 | - |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-1.pdf | 3.06 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。