透過多文件與多片段的證據溯源以驅動可驗證的內容生成

王睿誼; Jui-I Wang

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97920

Title:	透過多文件與多片段的證據溯源以驅動可驗證的內容生成 Driving Verifiable Generation through Multi-Document and Multi-Span Evidence Grounding
Authors:	王睿誼 Jui-I Wang
Advisor:	陳信希 Hsin-Hsi Chen
Keyword:	資訊檢索,問答系統, information retrieval,question answering,
Publication Year :	2025
Degree:	碩士
Abstract:	儘管檢索增強生成（RAG）技術已提升了大型語言模型（LLMs）的能力，但其「黑箱」的本質對可驗證性構成了挑戰，尤其是在需要整合多份文件資訊的複雜情境中。然而，現有評測資料集往往難以有效檢驗模型在整合多個分散證據片段以生成明確答案時，所需的細緻整合能力。為彌補此一差距，本研究提出一個框架，用以建構具挑戰性且以證據為基礎的問答（QA）評測基準，並透過我們新創的資料集 MultiDocSpanQA 將此框架具體實現。其嚴謹的建構流程包含：透過問題融合與文件分割來創建複雜的多文件情境、為確保證據清晰而進行指代消解以確保證據清晰，並採用一套嚴格的、基於蘊含關係的驗證方法，以確保每個證據片段的必要性，以及最終答案的忠實性。我們運用此評測基準的實證評估，揭示了幾項洞見：(1) 檢索效能是顯著的效能瓶頸；(2) 生成模型在「覆蓋範圍」與「精確性」之間，存在著固有的權衡取捨；(3) 提示策略具有深遠的影響，其中「先提供證據」的方法能系統性地提升答案的可驗證性與完整性。透過提供一個更具挑戰性、更細緻的評測基準，MultiDocSpanQA 旨在推動更先進的問答系統（特別是 RAG 模型）的發展，使其能具備強健的跨文件理解力、有效的證據聚合力，並能生成值得信賴且可驗證的答案。 While Retrieval-Augmented Generation (RAG) has enhanced Large Language Models (LLMs), their "black box" nature poses a challenge to verifiability, particularly in complex scenarios requiring information synthesis across multiple documents. Existing benchmarks often fail to test the fine-grained, multi-span aggregation necessary for generating answers explicitly grounded in scattered evidence. To address this gap, this work introduces a framework for constructing challenging, evidence-grounded QA benchmarks, instantiated through our novel dataset, MultiDocSpanQA. Its rigorous construction pipeline involves creating complex multi-document scenarios via query fusion and document segmentation, resolving anaphora for evidence clarity, and employing a rigorous entailment-based validation method that ensures both the necessity of every evidence span and the faithfulness of the final answer. Our empirical evaluation using this benchmark reveals several insights: (1) retrieval performance imposes a significant bottleneck, limiting even powerful generators; (2) an inherent trade-off in generative models between comprehensive coverage and concise precision; and (3) the profound impact of prompting strategy, where an "Evidence-First" approach systematically enhances answer verifiability and completeness. By offering a more challenging and nuanced benchmark, MultiDocSpanQA aims to drive the development of sophisticated QA systems, particularly RAG models, capable of robust cross-document comprehension, effective evidence aggregation, and the generation of trustworthy, verifiable answers.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97920
DOI:	10.6342/NTU202501932
Fulltext Rights:	未授權
metadata.dc.date.embargo-lift:	N/A
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf Restricted Access	1.8 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets