Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97920
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳信希zh_TW
dc.contributor.advisorHsin-Hsi Chenen
dc.contributor.author王睿誼zh_TW
dc.contributor.authorJui-I Wangen
dc.date.accessioned2025-07-23T16:06:29Z-
dc.date.available2025-07-24-
dc.date.copyright2025-07-23-
dc.date.issued2025-
dc.date.submitted2025-07-17-
dc.identifier.citationPayal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URL https://arxiv.org/abs/1611.09268.
Valeriia Bolotova-Baranova, Vladislav Blinov, Sofya Filippova, Falk Scholer, and Mark Sanderson. WikiHowQA: A comprehensive benchmark for multi-document nonfactoid question answering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5291–5314, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.290. URL https://aclanthology.org/2023.acl-long.290/.
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024. URL https://arxiv.org/abs/2402.03216.
DeepSeek-AI. Deepseek-v3 technical report, 2024. URL https://arxiv.org/abs/2412.19437.
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903.00161.
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine, 2017.
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://aclanthology.org/P19-1346/.
Google Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023a. URL https://arxiv.org/pdf/2312.11805.
Google Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023b. URL https://api.semanticscholar.org/CorpusID:266361876.
Jonas Golde, Patrick Haller, Felix Hamborg, Julian Risch, and Alan Akbik. Fabricator: An open source toolkit for generating labeled training data with teacher LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 1–11, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-demo.1.
Google Cloud / DeepMind. Gemini 2.5 Flash. https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash, June 2025. [Online; accessed 2025-07-15].
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and et al. Alex Vaughan. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
Sophie Henning, Talita Anthonio, Wei Zhou, Heike Adel, Mohsen Mesgar, and Annemarie Friedrich. Is the answer in the text? challenging ChatGPT with evidence retrieval from instructive text. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14229–14241, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-emnlp.949.
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580. URL https://aclanthology.org/2020.coling-main.580/.
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan BoydGraber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806/.
Siqing Huo, Negar Arabzadeh, and Charles L. A. Clarke. Retrieving supporting evidence for llms generated answers, 2023.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023/.
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models, 2025. URL https://arxiv.org/abs/2405.17428.
Seongyun Lee, Hyunjae Kim, and Jaewoo Kang. Liquid: A framework for list question answering dataset generation, 2023. URL https://arxiv.org/abs/2302.01691.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.
Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. MultiSpanQA: A dataset for multi-span question answering. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1250–1260, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022. naacl-main.90. URL https://aclanthology.org/2022.naacl-main.90.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023. URL https://arxiv.org/abs/2303.16634.
Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, April 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/.
OpenAI. Gpt-4 technical report. ArXiv, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
OpenAI. Introducing openai o3 and o4-mini, April 2025. URL https://openai.com/index/introducing-o3-and-o4-mini/.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264/.
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: Factoid questions meet long-form answers, 2023. URL https://arxiv.org/abs/2204.06092.
Yixuan Tang and Yi Yang. Multihop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=t4eB3zYWBK.
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, and et al. Louis Rouillard and. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786.
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074/.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. ♫ MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022. doi: 10.1162/tacl_a_00475. URL https://aclanthology.org/2022.tacl-1.31/.
Yuwei Wan, Yixuan Liu, Aswathy Ajith, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, and Ian Foster. Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation, 2024. URL https://arxiv.org/abs/2405.09939.
Jui-I Wang, Hen-Hsen Huang, and Hsin-Hsi Chen. MESAQA: A dataset for multi-span contextual and evidence-grounded question answering. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Computational Linguistics, pages 10891–10901, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-main.724/.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023. URL https://arxiv.org/abs/2212.10560.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018.
xAI. Grok 3 Beta—The Age of Reasoning Agents, February 2025. URL https://x.ai/news/grok-3.
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings, 2024. URL https://arxiv.org/abs/2309.07597.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
Zhiang Yue, Jingping Liu, Cong Zhang, Chao Wang, Haiyun Jiang, Yue Zhang, Xianyang Tian, Zhedong Cen, Yanghua Xiao, and Tong Ruan. Ma-mrc: A multi-answer machine reading comprehension dataset. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 2144-2148, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394086. doi: 10.1145/3539618.3592015. URL https://doi.org/10.1145/3539618.3592015.
Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. FanOutQA: A multi-hop, multi-document question answering benchmark for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 18–37, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.2. URL https://aclanthology.org/2024.acl-short.2/.
Ming Zhu, Aman Ahuja, Da-Cheng Juan, Wei Wei, and Chandan K. Reddy. Question answering with long multiple-span answers. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3840–3849, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.342. URL https://aclanthology.org/2020.findings-emnlp.342.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97920-
dc.description.abstract儘管檢索增強生成(RAG)技術已提升了大型語言模型(LLMs)的能力,但其「黑箱」的本質對可驗證性構成了挑戰,尤其是在需要整合多份文件資訊的複雜情境中。然而,現有評測資料集往往難以有效檢驗模型在整合多個分散證據片段以生成明確答案時,所需的細緻整合能力。
為彌補此一差距,本研究提出一個框架,用以建構具挑戰性且以證據為基礎的問答(QA)評測基準,並透過我們新創的資料集 MultiDocSpanQA 將此框架具體實現。其嚴謹的建構流程包含:透過問題融合與文件分割來創建複雜的多文件情境、為確保證據清晰而進行指代消解以確保證據清晰,並採用一套嚴格的、基於蘊含關係的驗證方法,以確保每個證據片段的必要性,以及最終答案的忠實性。我們運用此評測基準的實證評估,揭示了幾項洞見:(1) 檢索效能是顯著的效能瓶頸;(2) 生成模型在「覆蓋範圍」與「精確性」之間,存在著固有的權衡取捨;(3) 提示策略具有深遠的影響,其中「先提供證據」的方法能系統性地提升答案的可驗證性與完整性。透過提供一個更具挑戰性、更細緻的評測基準,MultiDocSpanQA 旨在推動更先進的問答系統(特別是 RAG 模型)的發展,使其能具備強健的跨文件理解力、有效的證據聚合力,並能生成值得信賴且可驗證的答案。
zh_TW
dc.description.abstractWhile Retrieval-Augmented Generation (RAG) has enhanced Large Language Models (LLMs), their "black box" nature poses a challenge to verifiability, particularly in complex scenarios requiring information synthesis across multiple documents. Existing benchmarks often fail to test the fine-grained, multi-span aggregation necessary for generating answers explicitly grounded in scattered evidence. To address this gap, this work introduces a framework for constructing challenging, evidence-grounded QA benchmarks, instantiated through our novel dataset, MultiDocSpanQA. Its rigorous construction pipeline involves creating complex multi-document scenarios via query fusion and document segmentation, resolving anaphora for evidence clarity, and employing a rigorous entailment-based validation method that ensures both the necessity of every evidence span and the faithfulness of the final answer. Our empirical evaluation using this benchmark reveals several insights: (1) retrieval performance imposes a significant bottleneck, limiting even powerful generators; (2) an inherent trade-off in generative models between comprehensive coverage and concise precision; and (3) the profound impact of prompting strategy, where an "Evidence-First" approach systematically enhances answer verifiability and completeness. By offering a more challenging and nuanced benchmark, MultiDocSpanQA aims to drive the development of sophisticated QA systems, particularly RAG models, capable of robust cross-document comprehension, effective evidence aggregation, and the generation of trustworthy, verifiable answers.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-23T16:06:29Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-23T16:06:29Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iii
Contents v
List of Figures viii
List of Tables x
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Thesis Organization 4
Chapter 2 Related Work 6
2.1 Evolution of QA from Single Documents 6
2.2 Integrating Information across Multiple Documents 7
2.3 Evidence Retrieval 8
2.4 Automated Dataset Generation 8
Chapter 3 Dataset Construction 10
3.1 Pipeline Overview 10
3.2 Multi-Doc Instance Construction 11
3.3 Anaphora Resolution 14
3.4 Answer Generation and Entailment Verification 15
3.4.1 Preliminary Quality Control: QA Relevance Check 16
3.4.2 Answer Generation and Entailment Verification 16
3.5 Statistics 19
Chapter 4 Experiment 22
4.1 Tasks Design 22
4.1.1 Task 1: Document-level Retrieval 22
4.1.2 Task 2: Retrieval-Augmented Generation (RAG) 23
4.1.3 Task 3: Machine Reading Comprehension (MRC) 23
4.1.4 Task 4: Evidence-Grounded Generation 24
4.2 Experimental Setup 24
4.2.1 Models 24
4.2.2 Evaluation Metrics 25
4.3 Results and Discussion 26
4.3.1 Document-level Retrieval 26
4.3.2 End-to-End Experiment with Retrieval & Generation 28
4.3.3 Machine Reading Comprehension 32
4.3.4 Evidence Grounding Generation 34
Chapter 5 Conclusion 37
Chapter 6 Limitation 38
Chapter 7 Future Work 39
References 40
Appendix A — Prompt Used 49
A.1 Prompt templates 49
-
dc.language.isoen-
dc.subject問答系統zh_TW
dc.subject資訊檢索zh_TW
dc.subject問答系統zh_TW
dc.subject資訊檢索zh_TW
dc.subjectquestion answeringen
dc.subjectinformation retrievalen
dc.subjectquestion answeringen
dc.subjectinformation retrievalen
dc.title透過多文件與多片段的證據溯源以驅動可驗證的內容生成zh_TW
dc.titleDriving Verifiable Generation through Multi-Document and Multi-Span Evidence Groundingen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳冠宇;黃乾綱 ;黃瀚萱zh_TW
dc.contributor.oralexamcommitteeKuan-Yu Chen;Chien-Kang Huang;Hen-Hsen Huangen
dc.subject.keyword資訊檢索,問答系統,zh_TW
dc.subject.keywordinformation retrieval,question answering,en
dc.relation.page54-
dc.identifier.doi10.6342/NTU202501932-
dc.rights.note未授權-
dc.date.accepted2025-07-18-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-liftN/A-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
1.8 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved