Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99099
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor丁建均zh_TW
dc.contributor.advisorJian-Jiun Dingen
dc.contributor.author洪昌甫zh_TW
dc.contributor.authorChang-Fu Hungen
dc.date.accessioned2025-08-21T16:22:59Z-
dc.date.available2025-08-22-
dc.date.copyright2025-08-21-
dc.date.issued2025-
dc.date.submitted2025-08-04-
dc.identifier.citationScharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47, 7-42.
Hirschmuller, H. (2007). Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2), 328-341.
Geiger, A., Roser, M., & Urtasun, R. (2010, November). Efficient large-scale stereo matching. In Asian conference on computer vision (pp. 25-38). Berlin, Heidelberg: Springer Berlin Heidelberg.
Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE international conference on computer vision (pp. 66-75).
Chang, J. R., & Chen, Y. S. (2018). Pyramid stereo matching network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5418).
Guo, X., Yang, K., Yang, W., Wang, X., & Li, H. (2019). Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3273-3282).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
Xu, G., Wang, X., Ding, X., & Yang, X. (2023). Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21919-21928).
Li, J., Wang, P., Xiong, P., Cai, T., Yan, Z., Yang, L., ... & Liu, S. (2022). Practical stereo matching via cascaded recurrent network with adaptive correlation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16263-16272).
Weinzaepfel, P., Lucas, T., Leroy, V., Cabon, Y., Arora, V., Brégier, R., ... & Revaud, J. (2023). Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 17969-17980).
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., ... & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568-578).
Hu, H., Cui, J., & Wang, L. (2021). Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16291-16301).
Xu, G., Cheng, J., Guo, P., & Yang, X. (2022). Attention concatenation volume for accurate and efficient stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12981-12990).
Zabih, R., & Woodfill, J. (1994). Non-parametric local transforms for computing visual correspondence. In Computer Vision—ECCV'94: Third European Conference on Computer Vision Stockholm, Sweden, May 2–6 1994 Proceedings, Volume II 3 (pp. 151-158). Springer Berlin Heidelberg.
Hirschmuller, H., & Scharstein, D. (2007, June). Evaluation of cost functions for stereo matching. In 2007 IEEE conference on computer vision and pattern recognition (pp. 1-8). IEEE.
Bradski, G. (2000). The opencv library. Dr. Dobb's Journal: Software Tools for the Professional Programmer, 25(11), 120-123.
Kolmogorov, V., & Zabih, R. (2001, July). Computing visual correspondence with occlusions using graph cuts. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001 (Vol. 2, pp. 508-515). IEEE.
Yang, Q., Wang, L., Yang, R., Wang, S., Liao, M., & Nister, D. (2006, September). Real-time Global Stereo Matching Using Hierarchical Belief Propagation. In Bmvc (Vol. 6, pp. 989-998).
Hirschmuller, H. (2005, June). Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (Vol. 2, pp. 807-814). IEEE.
Cheng, X., Zhong, Y., Harandi, M., Dai, Y., Chang, X., Li, H., ... & Ge, Z. (2020). Hierarchical neural architecture search for deep stereo matching. Advances in neural information processing systems, 33, 22158-22169.
Zhang, F., Prisacariu, V., Yang, R., & Torr, P. H. (2019). Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 185-194).
Lipson, L., Teed, Z., & Deng, J. (2021, December). Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV) (pp. 218-227). IEEE.
Shen, Z., Dai, Y., & Rao, Z. (2021). Cfnet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13906-13915).
Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F. X., Taylor, R. H., & Unberath, M. (2021). Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6197-6206).
Liu, C. W., Chen, Q., & Fan, R. (2024). Playing to Vision Foundation Model's Strengths in Stereo Matching. IEEE Transactions on Intelligent Vehicles.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125).
Rao, Z., Dai, Y., Shen, Z., & He, R. (2022). Rethinking training strategy in stereo matching. IEEE Transactions on Neural Networks and Learning Systems, 34(10), 7796-7809.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021, July). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347-10357). PMLR.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040-4048).
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4340-4349).
Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE.
Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3061-3070).
Schops, T., Schonberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., & Geiger, A. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3260-3269).
Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nešić, N., Wang, X., & Westling, P. (2014, September). High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition (pp. 31-42). Cham: Springer International Publishing.
Zhang, Y., Chen, Y., Bai, X., Yu, S., Yu, K., Li, Z., & Yang, K. (2020, April). Adaptive unimodal cost volume filtering for deep stereo matching. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 12926-12934).
Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., & Zhang, L. (2022, October). Pcw-net: Pyramid combination and warping cost volume for stereo matching. In European conference on computer vision (pp. 280-297). Cham: Springer Nature Switzerland.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99099-
dc.description.abstract在自駕車應用中,準確的雙目視差估計是實現三維場景理解與障礙物辨識的核心技術之一 。為提升視差預測於遮擋區與邊界區域的表現,本研究提出一種新穎的雙目立體視覺架構,結合凍結式 DINOv2 視覺基礎模型與三階段注意力融合模組,實現語意一致性、跨視圖對應與結構保持能力。
本架構首先從 DINOv2 的第 3、6、9、12 層擷取不同語意深度的特徵,並將深層特徵縮小為較低解析度,淺層特徵則保留較高解析度,以兼顧語意表徵與計算效率。這些特徵依序經過三階段融合模組處理:第一階段為自底向上的語意融合(bottom-to-up fusion),由低解析度特徵逐層向上整合,逐步將深層的全域語意資
訊傳遞至高解析度特徵中,以提升整體語意一致性與結構完整性;第二階段為自頂向下的細節補強(up-to-bottom fusion),強化高層語意在空間上的細節表現;第三階段則對每一層左圖與右圖特徵執行左右圖 cross-attention 運算,以強化兩視圖間的對應建模與遮擋區域的特徵補全能力。
融合後的特徵張量將輸入幾何導向的視差預測模組中,進行多階段差值估計與反覆優化。雖後端採用既有的IGEV 架構,但本研究的核心貢獻在於前端融合策略的設計,成功結合凍結語意特徵與模組化三階段融合機制,並具備良好的遮擋處理能力與系統擴展性。
本研究採用三階段訓練流程:以 SceneFlow 進行預訓練,再於 Virtual KITTI微調,最終導入KITTI2012 與 KITTI2015 實測資料集進行適應;ETH3D 則由SceneFlow 預訓練後直接微調。在KITTI2012 測試中,本模型於 3 像素誤差準則下達成 7.45% 的 Out-Noc 錯誤率與 1.2 像素之平均視差誤差。在 KITTI2015測試中,模型於所有像素範圍內達成 4.10% 的 D1-all 錯誤率與 6.70% 的 D1-fg前景誤差,展現於動態車輛與遮擋區域的穩定匹配能力。實驗結果顯示,本方法在遮擋推理與結構保持上具視覺一致性與優勢,展現其在自駕場景中應用之潛力。
zh_TW
dc.description.abstractIn autonomous driving applications, accurate binocular disparity estimation is one of the core technologies for achieving 3D scene understanding and obstacle recognition. To enhance disparity prediction performance in occluded and boundary regions, this study proposes a novel stereo vision architecture that integrates a frozen DINOv2 visual foundation model with a three-stage attention fusion module, enabling semantic consistency, cross-view correspondence, and structural preservation.
The architecture begins by extracting multi-level features from the 3rd, 6th, 9th, and 12th layers of DINOv2, where deeper features are downsampled to lower resolutions and shallower features retain higher spatial resolution, striking a balance between semantic representation and computational efficiency. These features are then processed through a three-stage fusion module : (1) The bottom-to-up semantic fusion stage gradually integrates low-resolution features upward, propagating global semantic information from deep layers into high-resolution representations, thereby enhancing semantic coherence and structural integrity.(2) The up-to-bottom detail refinement stage reinforces spatial details by transferring high-level semantic cues downward. (3) The cross-view attention
stage performs left-right cross-attention between features from the stereo pair at each scale, strengthening correspondence modeling and improving feature completion in occluded regions.
The fused feature tensor is subsequently fed into a geometry-guided disparity prediction module for multi-stage disparity regression and iterative refinement. Although the back-end disparity estimation follows the existing IGEV framework, the core contribution of this study lies in the design of the front-end fusion strategy, which successfully combines frozen semantic features with a modular three-stage attention mechanism. This design demonstrates strong occlusion handling capabilities and system scalability.
A three-stage training pipeline is adopted: the model is pre-trained on SceneFlow, fine-tuned on Virtual KITTI, and then adapted to real-world datasets KITTI2012 and KITTI2015. For ETH3D, the model is directly fine-tuned from SceneFlow pretraining. On the KITTI2012 test set, the model achieves an Out-Noc error rate of 7.45% and an average disparity error of 1.2 pixels under the 3-pixel threshold. On the KITTI2015 test set, the model attains a D1-all error of 4.10% and a D1-fg foreground error of 6.70%, demonstrating robust matching performance in dynamic and occluded regions. Experimental results show that the proposed method achieves superior visual consistency and structural preservation in occlusion reasoning, highlighting its potential for deployment in autonomous driving scenarios.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-21T16:22:59Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-21T16:22:59Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌謝 i
中文摘要 ii
ABSTRACT iv
CONTENTS vi
LIST OF FIGURES ix
LIST OF TABLES xv
Chapter 1
Introduction 1
1.1 Research Background 1
1.2 Motivation 3
1.3 Research Contributions 5
Chapter 2
Related Works 8
2.1 Traditional Methods 9
2.1.1 Local Methods 10
2.1.2 Global Methods 10
2.1.3 Semi-Global Methods 11
2.1.4 Summary and Discussion 12
2.2 CNNs-based Methods 12
2.2.1 Basic CNN Architectures 13
2.2.2 Advanced CNN Architectures 14
2.2.3 Summary and Discussion 15
2.3 Geometry-Guided Methods 16
2.3.1 IGEV : Iterative Geometry Encoding Volume for Stereo Matching 17
2.3.2 Other Representative Geometry-Guided Methods 18
2.3.3 Summary and Discussion 19
2.4 Transformer-based Methods 20
2.4.1 Representative Transformer-based Methods 20
2.4.2 Summary and Discussion 21
Chapter 3
Proposed Method 23
3.1 Data Preprocessing and Augmentation 30
3.1.1 Preprocessing Strategy 30
3.1.2 Augmentation Strategy 31
3.1.3 Differences Between Training and Inference 32
3.2 Multi-Scale ViT Feature Fusion Module 33
3.2.1 ViT Backbone Selection and Feature Extraction 34
3.2.2 Multi-Scale Module 37
3.2.3 Feature Fusion Mechanisms 40
3.3 Iterative Geometry Encoding Volume (IGEV) 50
3.3.1 Overview and Motivation 50
3.3.2 Integration and Implementation 51
Chapter 4
Experiments 52
4.1 Experimental Settings 52
4.1.1 Datasets 52
4.1.2 Evaluation Metrics 55
4.1.3 Implementation Details 57
4.2 Fine-Tuning Strategy 59
4.2.1 Module Freezing Strategy at Each Stage 59
4.2.2 KITTI 2012 Analysis 60
4.2.3 KITTI 2015 Analysis 61
4.2.4 ETH3D Analysis 63
4.2.5 Middlebury Analysis 64
4.2.6 Summary 65
4.3 Experimental Results 65
4.3.1 Results on Different Datasets 65
4.3.2 Comparison with Other Methods 70
Chapter 5
Conclusion 77
5.1 Technical Summary 77
5.2 Limitations and Directions for Improvement 78
5.3 Future Work 79
REFERENCE 82
-
dc.language.isoen-
dc.subjectDINOv2zh_TW
dc.subject視覺 Transformerzh_TW
dc.subject多層語意特徵整合zh_TW
dc.subject注意力融合zh_TW
dc.subject注意力機制zh_TW
dc.subject自駕車應用zh_TW
dc.subject雙目立體視覺zh_TW
dc.subject視差估計zh_TW
dc.subjectAttention Mechanismen
dc.subjectAutonomous Drivingen
dc.subjectMulti-Level Feature Integrationen
dc.subjectVision Transformeren
dc.subjectAttention Fusionen
dc.subjectDINOv2en
dc.subjectDisparity Estimationen
dc.subjectStereo Matchingen
dc.title基於 Dinov2 與多尺度注意力融合的雙目立體視覺視差估計zh_TW
dc.titleBinocular Stereo Vision Disparity Estimation Based on Dinov2 with Multi-Scale Attention Fusionen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee許文良;余執彰zh_TW
dc.contributor.oralexamcommitteeWen-Liang Hsue;Chih-Chang Yuen
dc.subject.keyword雙目立體視覺,視差估計,DINOv2,注意力融合,視覺 Transformer,多層語意特徵整合,自駕車應用,注意力機制,zh_TW
dc.subject.keywordStereo Matching,Disparity Estimation,DINOv2,Attention Fusion,Vision Transformer,Multi-Level Feature Integration,Autonomous Driving,Attention Mechanism,en
dc.relation.page87-
dc.identifier.doi10.6342/NTU202503655-
dc.rights.note未授權-
dc.date.accepted2025-08-08-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
dc.date.embargo-liftN/A-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
22.96 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved