請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98824完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 陳銘憲 | zh_TW |
| dc.contributor.advisor | Ming-Syan Chen | en |
| dc.contributor.author | 林佳儀 | zh_TW |
| dc.contributor.author | Chia-Yi Lin | en |
| dc.date.accessioned | 2025-08-19T16:20:36Z | - |
| dc.date.available | 2025-08-20 | - |
| dc.date.copyright | 2025-08-19 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-12 | - |
| dc.identifier.citation | References
[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neu-ral information processing systems, 33:9459–9474, 2020. [2] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box lan-guage models. arXiv preprint arXiv:2301.12652, 2023. [3] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther-ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. [4] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021. [5] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online, April 2021. Association for Computational Linguistics. [6] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020. [7] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiao-dan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram un-derstanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. [8] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. [9] Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hierarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. [10] Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, 2022. [11] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Interna-tional conference on machine learning, pages 8748–8763. PmLR, 2021 [12] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera-tion. In International conference on machine learning, pages 12888–12900. PMLR, 2022. [13] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020. [14] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Lay-outlm: Pre-training of text and layout for document image understanding. In Pro-ceedings of the 26th ACM SIGKDD international conference on knowledge discov-ery & data mining, pages 1192–1200, 2020. [15] Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and Sungrae Park. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775, 2022. [16] Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R Man-matha. Docformer: End-to-end transformer for document understanding. In Pro-ceedings of the IEEE/CVF international conference on computer vision, pages 993–1003, 2021. [17] Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Nenkova, and Tong Sun. Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems,34:39–50, 2021. [18] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740, 2020. [19] Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449, 2024. [20] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988. [21] Claudio Carpineto and Giovanni Romano. A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR), 44(1):1–50, 2012. [22] Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv preprint arXiv:2210.02928, 2022. [23] Linyong Nan, Weining Fang, Aylin Rasteh, Pouya Lahabi, Weijin Zou, Yilun Zhao, and Arman Cohan. Omg-qa: Building open-domain multi-modal generative ques-tion answering systems. In Proceedings of the 2024 Conference on Empirical Meth-ods in Natural Language Processing: Industry Track, pages 1001–1015, 2024. [24] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1818–1826, 2024. [25] Ray Smith. An overview of the tesseract ocr engine. In Ninth international confer-ence on document analysis and recognition (ICDAR 2007), volume 2, pages 629–633. IEEE, 2007. [26] Jiawang Liu, Ye Tao, Fei Wang, Hui Li, and Xiugong Qin. Siqa: A large multi-modal question answering model for structured images based on rag. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025. [27] Joseph John Rocchio Jr. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing, 1971. [28] Rada Mihalcea and Dan Moldovan. Semantic indexing using wordnet senses. In ACL-2000 workshop on Recent advances in natural language processing and infor-mation retrieval, pages 35–45, 2000. [29] Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023. [30] Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, 2023. [31] Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98824 | - |
| dc.description.abstract | 多模態檢索增強生成(RAG)系統在處理同時包含文本與視覺元素的文檔時,面臨維持跨模態一致性的重大挑戰。傳統方法受到模態隔離問題困擾,文本與圖像檢索各自獨立運作,經常導致回應結果結合了來自不同來源的無關文本與視覺內容。此問題在資源受限環境中更加嚴重,小型語言模型缺乏執行複雜跨模態增強任務所需的精密指令遵循能力。我們提出文本導向多模態 RAG,這是一種創新方法,利用文本檢索結果透過文本導向跨模態增強來改善圖像檢索效果。我們的方法透過首先執行文本檢索,接著使用 TF-IDF 分析萃取判別性關鍵詞以建立增強查詢進行圖像搜尋,解決模態隔離問題。我們引入新的重排序機制,優先選擇與檢索文本來自相同來源文檔的圖像,確保跨模態一致性。我們採用確定性關鍵詞萃取而非讓小型語言模型遵循複雜指令,避免了小型視覺語言模型無法處理複雜任務的限制。透過配對比較分析進行的全面實驗評估顯示在多項指標上獲得顯著改善。我們的方法在圖像檢索品質方面皆有提升(MRR:+20.1%,nDCG:+22.0%),同時維持適合本地部署情境的計算效率。比較分析顯示我們的文本導向方法優於小型語言模型輔助增強方法,展現出卓越的跨模態檢索效能。這些結果驗證了結構感知跨模態增強在資源受限多模態 RAG 部署上的有效性。 | zh_TW |
| dc.description.abstract | Multimodal Retrieval-Augmented Generation (RAG) systems face significant challenges in maintaining cross-modal coherence when processing documents containing both text and visual elements. Traditional approaches suffer from modal isolation, where text and image retrieval operate independently, often resulting in responses that combine unre-lated textual and visual content from disparate sources. This problem is exacerbated in resource-constrained environments where small language models (SLMs) lack the sophis-ticated instruction-following capabilities required for complex cross-modal enhancement tasks. We propose Text-Informed Multimodal RAG, a novel approach that leverages text retrieval results to enhance image retrieval through text-informed cross-modal enhance-ment. Our method addresses the modal isolation problem by first performing text retrieval, then extracting discriminative keywords using TF-IDF analysis to create enhanced queries for image search. We introduce new re-ranking mechanism that prioritizes images from the same source documents as retrieved text, ensuring cross-modal coherence. This ap-proach circumvents the limitations of small vision-language models by employing deter-ministic keyword extraction rather than complex instruction following. Comprehensive experimental evaluation through paired-comparison analysis demonstrates substantial im-provements across multiple metrics. Our approach achieves significant gains in image retrieval quality (MRR: +20.1%, nDCG: +22.0%), while maintaining computational eff-ciency suitable for local deployment scenarios. Comparative analysis reveals that our text-informed method substantially outperforms SLM-aided enhancement approaches, demon-strating superior cross-modal retrieval effectiveness. These results validate the effective-ness of structure-aware cross-modal enhancement for resource-constrained multimodal RAG deployment. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-19T16:20:36Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-19T16:20:36Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 ii Abstract iii Contents v List of Figures vii List of Tables viii 1 Introduction 1 2 Related Work 5 2.1 Multimodal Retrieval-Augmented Generation . . . . . . . . . . . . . 5 2.2 Vision-Language Models for Document Understanding . . . . . . . . 6 2.3 Query Enhancement and Expansion Techniques . . . . . . . . . . . . 9 2.4 Context-Aware Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Methods 12 3.1 System Architecture Overview . . . . . . . . . . . . . . . . . . . . 12 3.2 Component Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Enhanced Query Module . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Image Retrieve & Rerank Module . . . . . . . . . . . . . . . 17 3.2.3 Result Generation Module . . . . . . . . . . . . . . . . . . . 18 3.3 Algorithm Integration and Implementation . . . . . . . . . . . . . . 19 4 Experiments 21 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 Implementation Details and Design Rationale . . . . . . . . . 22 4.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Text-Informed Enhancement Results . . . . . . . . . . . . . . 23 4.2.2 SLM-Aided Comparison . . . . . . . . . . . . . . . . . . . . 25 4.2.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Conclusion 30 5.1 Summary of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 References 35 | - |
| dc.language.iso | en | - |
| dc.subject | 跨模態增強 | zh_TW |
| dc.subject | 文檔理解 | zh_TW |
| dc.subject | 本地部署 | zh_TW |
| dc.subject | 小型語言模型 | zh_TW |
| dc.subject | TF-IDF 分析 | zh_TW |
| dc.subject | 查詢增強 | zh_TW |
| dc.subject | 視覺語言模型 | zh_TW |
| dc.subject | 多模態檢索增強生成 | zh_TW |
| dc.subject | Vision-Language Models | en |
| dc.subject | Multimodal RAG | en |
| dc.subject | Cross-Modal Enhancement | en |
| dc.subject | Document Understanding | en |
| dc.subject | Local Deployment | en |
| dc.subject | Small Language Models | en |
| dc.subject | TF-IDF Analysis | en |
| dc.subject | Query Enhancement | en |
| dc.title | 多模態檢索增強生成:文本引導的跨模態增強技術用於高效本地部署 | zh_TW |
| dc.title | Multimodal RAG: Text-Informed Cross-Modal Enhancement for Efficient Local Deployment | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 楊得年;吳齊人;曹昱 | zh_TW |
| dc.contributor.oralexamcommittee | De-Nian Yang;Chi-Jen Wu;Yu Tsao | en |
| dc.subject.keyword | 多模態檢索增強生成,跨模態增強,視覺語言模型,文檔理解,本地部署,小型語言模型,TF-IDF 分析,查詢增強, | zh_TW |
| dc.subject.keyword | Multimodal RAG,Cross-Modal Enhancement,Vision-Language Models,Document Understanding,Local Deployment,Small Language Models,TF-IDF Analysis,Query Enhancement, | en |
| dc.relation.page | 39 | - |
| dc.identifier.doi | 10.6342/NTU202504201 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-08-14 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電機工程學系 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 電機工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 701.27 kB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
