Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99099
標題: 基於 Dinov2 與多尺度注意力融合的雙目立體視覺視差估計
Binocular Stereo Vision Disparity Estimation Based on Dinov2 with Multi-Scale Attention Fusion
作者: 洪昌甫
Chang-Fu Hung
指導教授: 丁建均
Jian-Jiun Ding
關鍵字: 雙目立體視覺,視差估計,DINOv2,注意力融合,視覺 Transformer,多層語意特徵整合,自駕車應用,注意力機制,
Stereo Matching,Disparity Estimation,DINOv2,Attention Fusion,Vision Transformer,Multi-Level Feature Integration,Autonomous Driving,Attention Mechanism,
出版年 : 2025
學位: 碩士
摘要: 在自駕車應用中,準確的雙目視差估計是實現三維場景理解與障礙物辨識的核心技術之一 。為提升視差預測於遮擋區與邊界區域的表現,本研究提出一種新穎的雙目立體視覺架構,結合凍結式 DINOv2 視覺基礎模型與三階段注意力融合模組,實現語意一致性、跨視圖對應與結構保持能力。
本架構首先從 DINOv2 的第 3、6、9、12 層擷取不同語意深度的特徵,並將深層特徵縮小為較低解析度,淺層特徵則保留較高解析度,以兼顧語意表徵與計算效率。這些特徵依序經過三階段融合模組處理:第一階段為自底向上的語意融合(bottom-to-up fusion),由低解析度特徵逐層向上整合,逐步將深層的全域語意資
訊傳遞至高解析度特徵中,以提升整體語意一致性與結構完整性;第二階段為自頂向下的細節補強(up-to-bottom fusion),強化高層語意在空間上的細節表現;第三階段則對每一層左圖與右圖特徵執行左右圖 cross-attention 運算,以強化兩視圖間的對應建模與遮擋區域的特徵補全能力。
融合後的特徵張量將輸入幾何導向的視差預測模組中,進行多階段差值估計與反覆優化。雖後端採用既有的IGEV 架構,但本研究的核心貢獻在於前端融合策略的設計,成功結合凍結語意特徵與模組化三階段融合機制,並具備良好的遮擋處理能力與系統擴展性。
本研究採用三階段訓練流程:以 SceneFlow 進行預訓練,再於 Virtual KITTI微調,最終導入KITTI2012 與 KITTI2015 實測資料集進行適應;ETH3D 則由SceneFlow 預訓練後直接微調。在KITTI2012 測試中,本模型於 3 像素誤差準則下達成 7.45% 的 Out-Noc 錯誤率與 1.2 像素之平均視差誤差。在 KITTI2015測試中,模型於所有像素範圍內達成 4.10% 的 D1-all 錯誤率與 6.70% 的 D1-fg前景誤差,展現於動態車輛與遮擋區域的穩定匹配能力。實驗結果顯示,本方法在遮擋推理與結構保持上具視覺一致性與優勢,展現其在自駕場景中應用之潛力。
In autonomous driving applications, accurate binocular disparity estimation is one of the core technologies for achieving 3D scene understanding and obstacle recognition. To enhance disparity prediction performance in occluded and boundary regions, this study proposes a novel stereo vision architecture that integrates a frozen DINOv2 visual foundation model with a three-stage attention fusion module, enabling semantic consistency, cross-view correspondence, and structural preservation.
The architecture begins by extracting multi-level features from the 3rd, 6th, 9th, and 12th layers of DINOv2, where deeper features are downsampled to lower resolutions and shallower features retain higher spatial resolution, striking a balance between semantic representation and computational efficiency. These features are then processed through a three-stage fusion module : (1) The bottom-to-up semantic fusion stage gradually integrates low-resolution features upward, propagating global semantic information from deep layers into high-resolution representations, thereby enhancing semantic coherence and structural integrity.(2) The up-to-bottom detail refinement stage reinforces spatial details by transferring high-level semantic cues downward. (3) The cross-view attention
stage performs left-right cross-attention between features from the stereo pair at each scale, strengthening correspondence modeling and improving feature completion in occluded regions.
The fused feature tensor is subsequently fed into a geometry-guided disparity prediction module for multi-stage disparity regression and iterative refinement. Although the back-end disparity estimation follows the existing IGEV framework, the core contribution of this study lies in the design of the front-end fusion strategy, which successfully combines frozen semantic features with a modular three-stage attention mechanism. This design demonstrates strong occlusion handling capabilities and system scalability.
A three-stage training pipeline is adopted: the model is pre-trained on SceneFlow, fine-tuned on Virtual KITTI, and then adapted to real-world datasets KITTI2012 and KITTI2015. For ETH3D, the model is directly fine-tuned from SceneFlow pretraining. On the KITTI2012 test set, the model achieves an Out-Noc error rate of 7.45% and an average disparity error of 1.2 pixels under the 3-pixel threshold. On the KITTI2015 test set, the model attains a D1-all error of 4.10% and a D1-fg foreground error of 6.70%, demonstrating robust matching performance in dynamic and occluded regions. Experimental results show that the proposed method achieves superior visual consistency and structural preservation in occlusion reasoning, highlighting its potential for deployment in autonomous driving scenarios.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99099
DOI: 10.6342/NTU202503655
全文授權: 未授權
電子全文公開日期: N/A
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
22.96 MBAdobe PDF
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved