請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98529完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 郭斯彥 | zh_TW |
| dc.contributor.advisor | Sy-Yen Kuo | en |
| dc.contributor.author | 劉昱瑋 | zh_TW |
| dc.contributor.author | Yu-Wei Liu | en |
| dc.date.accessioned | 2025-08-14T16:28:10Z | - |
| dc.date.available | 2025-08-15 | - |
| dc.date.copyright | 2025-08-14 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-30 | - |
| dc.identifier.citation | T. Blau, S. Fogel, R. Ronen, A. Golts, R. Ganz, E. Ben Avraham, A. Aberdam, S. Tsiper, and R. Litman. Gram: Global reasoning for multi-page vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15598–15607, 2024.
J. H. Caufield, H. Hegde, V. Emonet, N. L. Harris, M. P. Joachimiak, N. Matentzoglu, H. Kim, S. Moxon, J. T. Reese, M. A. Haendel, et al. Structured prompt interrogation and recursive extraction of semantics (spires): A method for populating knowledge bases using zero-shot learning. Bioinformatics, 40(3):btae104, 2024. I. Chen, W.-T. Chen, Y.-W. Liu, Y.-C. Chiang, S.-Y. Kuo, M.-H. Yang, et al. Unirestore: Unified perceptual and task-oriented image restoration model using diffusion prior. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17969–17979, 2025. Y. Chen, J. Zhang, K. Peng, J. Zheng, R. Liu, P. Torr, and R. Stiefelhagen. Rodla: Benchmarking the robustness of document layout analysis models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15556–15566, 2024. L. Deng, Y. Sun, S. Chen, N. Yang, Y. Wang, and R. Song. Muka: Multimodal knowledge augmented visual information-seeking. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9675–9686, 2025. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. M. Faysse, H. Sibille, T. Wu, B. Omrani, G. Viaud, C. Hudelot, and P. Colombo. Colpali: Efficient document retrieval with vision language models. In The Thirteenth International Conference on Learning Representations, 2024. A. W. Harley, A. Ufkes, and K. G. Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. In International Conference on Document Analysis and Recognition (ICDAR), 2015. L. Kang, R. Tito, E. Valveny, and D. Karatzas. Multi-page document visual question answering using self-attention scoring mechanism. In International Conference on Document Analysis and Recognition, pages 219–232. Springer, 2024. O. Khattab and M. Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39– 48, 2020. C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. Nvembed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022. L. Li, Y. Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024. Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön. Controlling visionlanguage models for multi-task image restoration. arXiv preprint arXiv:2310.01018, 2023. A. Mansurova, A. Mansurova, and A. Nugumanova. Qa-rag: Exploring llm reliance on external knowledge. Big Data and Cognitive Computing, 8(9):115, 2024. A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. N. Methani, P. Ganguly, M. M. Khapra, and P. Kumar. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020. V. Potlapalli, S. W. Zamir, S. H. Khan, and F. Shahbaz Khan. Promptir: Prompting for all-in-one image restoration. Advances in Neural Information Processing Systems, 36:71275–71293, 2023. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. S. Robertson, H. Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009. T. Son, J. Kang, N. Kim, S. Cho, and S. Kwak. Urie: Universal image enhancement for visual recognition in the wild. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 749–765. Springer, 2020. A. Sulaiman, K. Omar, and M. F. Nasrudin. Degraded historical document binarization: A review on issues, challenges, techniques, and future directions. Journal of imaging, 5(4):48, 2019. R. Tanaka, K. Nishida, K. Nishida, T. Hasegawa, I. Saito, and K. Saito. Slidevqa: A dataset for document visual question answering on multiple images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023. R. Tito, D. Karatzas, and E. Valveny. Hierarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. J. Van Landeghem, R. Tito, Ł. Borchmann, M. Pietruszka, P. Joziak, R. Powalski, D. Jurkiewicz, M. Coustaty, B. Anckaert, E. Valveny, et al. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540, 2023. Z. Xu, S. Jain, and M. Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024. Y. Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. F. Ye, S. Li, Y. Zhang, and L. Chen. R^ 2ag: Incorporating retrieval information into retrieval augmented generation. arXiv preprint arXiv:2406.13249, 2024. S. Yu, C. Tang, B. Xu, J. Cui, J. Ran, Y. Yan, Z. Liu, S. Wang, X. Han, Z. Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594, 2024. Z. Yu, C. Xiong, S. Yu, and Z. Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331, 2023. S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728– 5739, 2022. X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. J. Zhang, Q. Zhang, B. Wang, L. Ouyang, Z. Wen, Y. Li, K.-H. Chow, C. He, and W. Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrievalaugmented generation. arXiv preprint arXiv:2412.02592, 2024. P. Zhang, S. Xiao, Z. Liu, Z. Dou, and J.-Y. Nie. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554, 2023. Q. Zhang, V. S.-J. Huang, B. Wang, J. Zhang, Z. Wang, H. Liang, S. Wang, M. Lin, C. He, and W. Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. arXiv preprint arXiv:2410.21169, 2024. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98529 | - |
| dc.description.abstract | 檢索增強生成(Retrieval-Augmented Generation, RAG)已是提升大型語言模型處理知識密集型任務的關鍵技術。然而,即便在直接提供包含正確答案的降質文件影像的理想情境下,一個 7B 參數級的視覺語言模型在生成答案時的性能,依然會出現約一成的顯著滑落,這促使我們尋求一個更整合、更根本的解決方 案。本論文針對降質文件檢索提出一個系統性的解決框架。我們使用處理降質影像所發展的特徵解耦(feature disentanglement)方法,並將其應用於資訊檢索任務。我們設計了一個輕量化的「降質特徵模組 (DFM)」,使其與一個固定的預訓練視覺編碼器並行運作,旨在將降質影像的表徵顯式地解耦為一個純淨的「內容嵌 入」和一個獨立的「降質嵌入」。這兩組嵌入隨後被融合並由大型語言模型處理,以生成一個對降質穩健的最終文件表徵。本論文的貢獻如下:(1) 我們系統性地構建了一套包含 12 種合成降質以及 5 種真實世界降質的全面評估資料集。(2) 透過詳盡的實驗,我們量化了當前最先進的 RAG 模型在面對各類降質時的性能衰退,並指認出視覺導向的 RAG 模型在處理真實拍攝影像時,因「環境噪聲」而導致性能嚴重下降的現象。(3) 我們詳細闡述了一個旨在解決此問題的解耦式檢索框架的設計與實作,及其對應的訓練策略。(4) 我們的策略運用在視覺導向的 RAG 模型上,能微幅的提升其性能並保持高效率。本研究的成果對於資訊檢索、數位人 文、企業知識管理等領域的研究者與實務工作者具有參考價值。我們的發現與框架,為解鎖存儲在大量低品質掃描件、歷史檔案與使用者上傳照片中寶貴資訊的潛力,提供了可能的途徑。 | zh_TW |
| dc.description.abstract | Retrieval-Augmented Generation (RAG) has become a key technology for enhancing the performance of Large Language Models on knowledge-intensive tasks. However, even in an ideal scenario where a degraded document image containing the correct answer is directly provided, the generation performance of a 7B-parameter Vision-Language Model still exhibits a significant decline of approximately ten percent. This finding prompted us to seek a more integrated and fundamental solution.
This thesis presents a systematic framework for degraded document retrieval. We adapt feature disentanglement methods, developed for processing degraded images, to the task of information retrieval. We designed a lightweight Degraded Feature Module (DFM) that operates in parallel with a frozen, pre-trained vision encoder, with the goal of explicitly disentangling the representation of a degraded image into a clean ”content embedding” and a separate ”degradation embedding.” These two sets of embeddings are then fused and processed by a Large Language Model to generate a final, degradation-robust document representation. The contributions of this thesis are as follows: (1) We systematically constructed a comprehensive evaluation suite comprising 12 types of synthetic degradations and 5 types of real-world degradations. (2) Through exhaustive experiments, we quantified the performance degradation of current state-of-the-art (SOTA) RAG models when faced with various degradations, and we identified the phenomenon of ”environmental noise” causing severe performance drops for vision-based models when processing real-world captured images. (3) We elaborated on the design and implementation of a disentanglement-based retrieval framework aimed at solving this issue, and its corresponding causality-inspired training strategy.(4) Our strategy, when applied to vision-based models, modestly improves their performance while maintaining high efficiency. The findings of this study are of significant value to researchers and practitioners in the fields of information retrieval, digital humanities, and enterprise knowledge management. Our findings and framework offer a potential pathway for unlocking the valuable information stored in vast quantities of low-quality scans, historical archives, and user-uploaded photographs. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-14T16:28:10Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-14T16:28:10Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 iii Abstract v Contents vii List of Figures x List of Tables xi Chapter 1 Introduction 1 1.1 Limitations of Large Language Models and the Rise of Retrieval-Augmented Generation 1 1.2 Challenges of RAG for Document Understanding 3 1.3 Document Image Quality Degradation: The Core Obstacle 4 1.4 Research Motivation and Problem Statement 5 1.5 Proposed Method and Contributions 6 1.6 Application Value and Thesis Structure 8 Chapter 2 Related Work 9 2.1 Text-based RAG 9 2.2 Vision-based RAG 11 2.2.1 Foundational Vision-Language Models 11 2.2.2 Vision-based Retrieval Systems 12 2.3 Image Restoration 13 2.4 Application of Image Restoration to Retrieval Tasks 14 Chapter 3 Methodology 16 3.1 Proposed Framework: An Adaptive Disentanglement-based Architecture 16 3.1.1 Architecture Adaptation 17 3.1.2 Disentangled Feature Generation 17 3.2 Two-Stage Training Strategy 18 3.2.1 Stage 1: Training the Degraded Feature Module 18 3.2.2 Stage 2: Fine-tuning the Language Model for Retrieval 20 3.3 Dataset Construction 22 3.3.1 Synthetic Degradation Dataset 22 3.3.2 Real-World Degradation Dataset 25 Chapter 4 Experiments 28 4.1 Experimental Setup 28 4.1.1 Datasets 29 4.1.1.1 Base Datasets for Retrieval 29 4.1.1.2 Degraded Datasets 30 4.1.2 Evaluation Metrics 31 4.1.3 Baselines and Models 31 4.2 Results and Analysis 33 4.2.1 Performance on Clean Document Datasets (d3, d7) 34 4.2.2 Performance on Synthetically Degraded Datasets (d6) 35 4.2.3 Performance on Real-World Degraded Datasets (d8, d9) 36 Chapter 5 Conclusion 38 5.1 Discussion 38 5.2 Future Work 40 References 44 | - |
| dc.language.iso | en | - |
| dc.subject | 檢索增強生成 (RAG) | zh_TW |
| dc.subject | 降質文件檢索 | zh_TW |
| dc.subject | 視覺語言模型 (VLM) | zh_TW |
| dc.subject | 特徵解耦 | zh_TW |
| dc.subject | 資訊檢索 | zh_TW |
| dc.subject | 穩健性 | zh_TW |
| dc.subject | 影像衰退 | zh_TW |
| dc.subject | Degraded Document Retrieval | en |
| dc.subject | Image Degradation | en |
| dc.subject | Robustness | en |
| dc.subject | Information Retrieval | en |
| dc.subject | Feature Disentanglement | en |
| dc.subject | Vision-Language Models(VLM) | en |
| dc.subject | Retrieval-Augmented Generation(RAG) | en |
| dc.title | 優化挑戰性影像環境下的多模態檢索增強生成策略 | zh_TW |
| dc.title | Optimizing Multimodal Retrieval-Augmented Generation Strategies for Challenged Image Environments | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 張耀文;雷欽隆;顏嗣鈞;陳俊良 | zh_TW |
| dc.contributor.oralexamcommittee | Yao-Wen Chang;Chin-Laung Lei;Hsu-chun Yen;Jiann-Liang Chen | en |
| dc.subject.keyword | 檢索增強生成 (RAG),降質文件檢索,視覺語言模型 (VLM),特徵解耦,資訊檢索,穩健性,影像衰退, | zh_TW |
| dc.subject.keyword | Retrieval-Augmented Generation(RAG),Degraded Document Retrieval,Vision-Language Models(VLM),Feature Disentanglement,Information Retrieval,Robustness,Image Degradation, | en |
| dc.relation.page | 49 | - |
| dc.identifier.doi | 10.6342/NTU202502697 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-07-31 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電子工程學研究所 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 電子工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 10.71 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
