優化挑戰性影像環境下的多模態檢索增強生成策略

劉昱瑋; Yu-Wei Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98529

標題:	優化挑戰性影像環境下的多模態檢索增強生成策略 Optimizing Multimodal Retrieval-Augmented Generation Strategies for Challenged Image Environments
作者:	劉昱瑋 Yu-Wei Liu
指導教授:	郭斯彥 Sy-Yen Kuo
關鍵字:	檢索增強生成 (RAG),降質文件檢索,視覺語言模型 (VLM),特徵解耦,資訊檢索,穩健性,影像衰退, Retrieval-Augmented Generation(RAG),Degraded Document Retrieval,Vision-Language Models(VLM),Feature Disentanglement,Information Retrieval,Robustness,Image Degradation,
出版年 :	2025
學位:	碩士
摘要:	檢索增強生成（Retrieval-Augmented Generation, RAG）已是提升大型語言模型處理知識密集型任務的關鍵技術。然而，即便在直接提供包含正確答案的降質文件影像的理想情境下，一個 7B 參數級的視覺語言模型在生成答案時的性能，依然會出現約一成的顯著滑落，這促使我們尋求一個更整合、更根本的解決方案。本論文針對降質文件檢索提出一個系統性的解決框架。我們使用處理降質影像所發展的特徵解耦（feature disentanglement）方法，並將其應用於資訊檢索任務。我們設計了一個輕量化的「降質特徵模組 (DFM)」，使其與一個固定的預訓練視覺編碼器並行運作，旨在將降質影像的表徵顯式地解耦為一個純淨的「內容嵌入」和一個獨立的「降質嵌入」。這兩組嵌入隨後被融合並由大型語言模型處理，以生成一個對降質穩健的最終文件表徵。本論文的貢獻如下：(1) 我們系統性地構建了一套包含 12 種合成降質以及 5 種真實世界降質的全面評估資料集。(2) 透過詳盡的實驗，我們量化了當前最先進的 RAG 模型在面對各類降質時的性能衰退，並指認出視覺導向的 RAG 模型在處理真實拍攝影像時，因「環境噪聲」而導致性能嚴重下降的現象。(3) 我們詳細闡述了一個旨在解決此問題的解耦式檢索框架的設計與實作，及其對應的訓練策略。(4) 我們的策略運用在視覺導向的 RAG 模型上，能微幅的提升其性能並保持高效率。本研究的成果對於資訊檢索、數位人文、企業知識管理等領域的研究者與實務工作者具有參考價值。我們的發現與框架，為解鎖存儲在大量低品質掃描件、歷史檔案與使用者上傳照片中寶貴資訊的潛力，提供了可能的途徑。 Retrieval-Augmented Generation (RAG) has become a key technology for enhancing the performance of Large Language Models on knowledge-intensive tasks. However, even in an ideal scenario where a degraded document image containing the correct answer is directly provided, the generation performance of a 7B-parameter Vision-Language Model still exhibits a significant decline of approximately ten percent. This finding prompted us to seek a more integrated and fundamental solution. This thesis presents a systematic framework for degraded document retrieval. We adapt feature disentanglement methods, developed for processing degraded images, to the task of information retrieval. We designed a lightweight Degraded Feature Module (DFM) that operates in parallel with a frozen, pre-trained vision encoder, with the goal of explicitly disentangling the representation of a degraded image into a clean ”content embedding” and a separate ”degradation embedding.” These two sets of embeddings are then fused and processed by a Large Language Model to generate a final, degradation-robust document representation. The contributions of this thesis are as follows: (1) We systematically constructed a comprehensive evaluation suite comprising 12 types of synthetic degradations and 5 types of real-world degradations. (2) Through exhaustive experiments, we quantified the performance degradation of current state-of-the-art (SOTA) RAG models when faced with various degradations, and we identified the phenomenon of ”environmental noise” causing severe performance drops for vision-based models when processing real-world captured images. (3) We elaborated on the design and implementation of a disentanglement-based retrieval framework aimed at solving this issue, and its corresponding causality-inspired training strategy.(4) Our strategy, when applied to vision-based models, modestly improves their performance while maintaining high efficiency. The findings of this study are of significant value to researchers and practitioners in the fields of information retrieval, digital humanities, and enterprise knowledge management. Our findings and framework offer a potential pathway for unlocking the valuable information stored in vast quantities of low-quality scans, historical archives, and user-uploaded photographs.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98529
DOI:	10.6342/NTU202502697
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	10.71 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。