使用知識圖的多模態檢索增強生成

蕭啟湘; Chi-Hsiang Hsiao

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97821

標題:	使用知識圖的多模態檢索增強生成 Multimodal Retrieval Argument Generation with Knowledge Graph
作者:	蕭啟湘 Chi-Hsiang Hsiao
指導教授:	陳祝嵩 Chu-Song Chen
關鍵字:	大語言模型,多模態,檢索增強生成,知識圖,資訊檢索, Large Language Model,Multimodal,Retrieval Argument Generation,Knowledge Graph,Information Retrieval,
出版年 :	2025
學位:	碩士
摘要:	檢索增強生成結合大型語言模型與外部資料檢索機制，在開放式問題解決任務中展現出優越的能力。然而，受到上下文長度的限制，這類模型在面對長篇內容或需整體語意推理時，經常無法完整掌握複雜資訊，特別是在如專書等長篇領域資料中，深度推理能力有限。為了克服此挑戰，知識圖譜作為以實體為核心的圖形結構，輔以階層式摘要，已被廣泛應用於推理與理解任務中，有效提升知識的結構化表達。不過，現有基於知識圖譜的檢索增強生成技術大多僅支援文本資料處理，對於圖像、表格等視覺模態資訊的運用明顯不足，限制了多元訊息的整合與理解。進一步而言，唯有將視覺特徵與空間分布等多模態線索一併納入，方能建構更為完整且精確的知識結構。有鑑於此，本研究提出跨模態知識圖譜檢索增強生成框架（MegaRAG），致力於融合文本與視覺等多模態資料於知識圖譜建構與推理過程，賦予模型更豐富的語意理解與多模態推理能力。此多模態知識圖譜設計，強化了對情境與語意的多面向感知，能更有效地捕捉與呈現多模態間的語義聯結。根據多項實驗結果顯示，MegaRAG 在純文本及多模態語料庫情境下，皆優於現有檢索增強生成方法，展現穩定且卓越的任務表現。 Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, demonstrating powerful capabilities for open-domain question-answering tasks. However, due to context window limitations, these models still face challenges in high-level conceptual understanding and holistic comprehension, constraining their ability to perform deep reasoning over lengthy, domain-specific content such as entire books. To address this limitation, knowledge graphs (KGs) have been employed to construct entity-centric graph structures and hierarchical summaries, providing more structured support for the reasoning process. Nevertheless, existing knowledge graph-based retrieval-augmented generation solutions can only handle textual inputs and fail to effectively utilize additional information provided by other modalities such as visual content. Furthermore, effective reasoning from documents containing images requires integrating multiple cues—including textual information, visual elements, and spatial layout—into hierarchically structured conceptual representations. To address these challenges, this study proposes a multimodal knowledge graph-enhanced retrieval-augmented generation approach (MegaRAG), which enables cross-modal reasoning to achieve deeper content understanding. MegaRAG integrates visual cues into both the construction and reasoning processes of knowledge graphs. The resulting multimodal knowledge graph enhances context-aware graph representations, better capturing the semantic features of multimodal inputs. Experimental results on both global and fine-grained question-answering tasks demonstrate that MegaRAG consistently outperforms existing retrieval-augmented generation methods across both textual and multimodal corpora.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97821
DOI:	10.6342/NTU202501901
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-07-19
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	7.46 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。