Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97821
Title: 使用知識圖的多模態檢索增強生成
Multimodal Retrieval Argument Generation with Knowledge Graph
Authors: 蕭啟湘
Chi-Hsiang Hsiao
Advisor: 陳祝嵩
Chu-Song Chen
Keyword: 大語言模型,多模態,檢索增強生成,知識圖,資訊檢索,
Large Language Model,Multimodal,Retrieval Argument Generation,Knowledge Graph,Information Retrieval,
Publication Year : 2025
Degree: 碩士
Abstract: 檢索增強生成結合大型語言模型與外部資料檢索機制,在開放式問題解決任務中展現出優越的能力。然而,受到上下文長度的限制,這類模型在面對長篇內容或需整體語意推理時,經常無法完整掌握複雜資訊,特別是在如專書等長篇領域資料中,深度推理能力有限。為了克服此挑戰,知識圖譜作為以實體為核心的圖形結構,輔以階層式摘要,已被廣泛應用於推理與理解任務中,有效提升知識的結構化表達。不過,現有基於知識圖譜的檢索增強生成技術大多僅支援文本資料處理,對於圖像、表格等視覺模態資訊的運用明顯不足,限制了多元訊息的整合與理解。進一步而言,唯有將視覺特徵與空間分布等多模態線索一併納入,方能建構更為完整且精確的知識結構。有鑑於此,本研究提出跨模態知識圖譜檢索增強生成框架(MegaRAG),致力於融合文本與視覺等多模態資料於知識圖譜建構與推理過程,賦予模型更豐富的語意理解與多模態推理能力。此多模態知識圖譜設計,強化了對情境與語意的多面向感知,能更有效地捕捉與呈現多模態間的語義聯結。根據多項實驗結果顯示,MegaRAG 在純文本及多模態語料庫情境下,皆優於現有檢索增強生成方法,展現穩定且卓越的任務表現。
Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, demonstrating powerful capabilities for open-domain question-answering tasks. However, due to context window limitations, these models still face challenges in high-level conceptual understanding and holistic comprehension, constraining their ability to perform deep reasoning over lengthy, domain-specific content such as entire books. To address this limitation, knowledge graphs (KGs) have been employed to construct entity-centric graph structures and hierarchical summaries, providing more structured support for the reasoning process. Nevertheless, existing knowledge graph-based retrieval-augmented generation solutions can only handle textual inputs and fail to effectively utilize additional information provided by other modalities such as visual content. Furthermore, effective reasoning from documents containing images requires integrating multiple cues—including textual information, visual elements, and spatial layout—into hierarchically structured conceptual representations. To address these challenges, this study proposes a multimodal knowledge graph-enhanced retrieval-augmented generation approach (MegaRAG), which enables cross-modal reasoning to achieve deeper content understanding. MegaRAG integrates visual cues into both the construction and reasoning processes of knowledge graphs. The resulting multimodal knowledge graph enhances context-aware graph representations, better capturing the semantic features of multimodal inputs. Experimental results on both global and fine-grained question-answering tasks demonstrate that MegaRAG consistently outperforms existing retrieval-augmented generation methods across both textual and multimodal corpora.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97821
DOI: 10.6342/NTU202501901
Fulltext Rights: 同意授權(全球公開)
metadata.dc.date.embargo-lift: 2025-07-19
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-113-2.pdf7.46 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved