GAIA-Ext：以問題層級對齊資料集對 AI 代理過擬合進行基準測試

吳耘安; Yun-Ang Wu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100997

Title:	GAIA-Ext：以問題層級對齊資料集對 AI 代理過擬合進行基準測試 GAIA-Ext: Benchmarking AI Agent Overfitting with a Problem-Level Aligned Dataset
Authors:	吳耘安 Yun-Ang Wu
Advisor:	林守德 Shou-De Lin
Keyword:	AI代理,大型語言模型資料集問題層級對齊問題生成上下文學習提示工程 AI Agents,Large Language ModelsDatasetsProblem-Level AlignmentQuestion GenerationIn-Context LearningPrompt Engineering
Publication Year :	2025
Degree:	碩士
Abstract:	近年來，鑒於大型語言模型發展快速，AI 代理具備了更為複雜的推論、規劃與工具使用的能力。但是在真實世界的應用場景中，AI 代理往往需要針對特定領域或使用者情境進行調整及適應。這種方式雖然提升了 AI 代理在此情境的專業性，但也增加了過度擬合於訓練樣本的風險，並減弱了其在類題上的泛化能力。因為現有的基準測試資料集通常缺乏對齊的類題做測試使用，它們並不能充分評估代理的問題層級的泛化。為了解決這一缺口，我們開發了一個用來產生類題的問題生成框架，並以此為基礎，構建了一種以 LLM 為基礎的自動生成問題系統，並透過嚴謹的人工評估驗證了其有效性。此外，我們通過擴展 GAIA 資料集建構了 GAIA-Ext，一個全新的基準資料集。這個資料集使得研究者能夠以更為嚴謹的方式評估 AI 代理的泛化能力。我們的實驗研究顯示，GAIA-Ext 能夠有效揭示當前 AI 代理的局限性，以及觀察到其過擬合的現象。為了推動 AI 代理的進一步研究，我們會在未來將 GAIA-Ext 數據集與實驗程式碼公開釋出，供學術研究使用。 Recent advances in large language models have empowered AI agents with sophisticated reasoning, planning, and tool use abilities. However, in real-world deployments, AI agents often require adaptation to specific domains or user contexts. This adaptation increases the risk of overfitting to training samples and reduces their ability to generalize to related but unseen problems. Existing benchmarks do not provide sufficient evaluation of problem-level generalization, because they lack systematically constructed and structurally aligned test variants for each validation query. To address this gap, we developed a question generation framework to produce structurally similar variants of agent-based questions. We also constructed an LLM-based method to automatically generate such questions, and we performed robust human evaluation to prove its effectiveness. In addition, we introduce GAIA-Ext, a new benchmark that extends the well-known GAIA benchmark dataset by associating each original task with multiple problem-level aligned variants. This design enables principled assessment of agent generalization beyond observed examples. Our empirical studies demonstrate that GAIA-Ext supports robust diagnosis of overfitting and exposes limitations in current agent adaptation strategies. To encourage further research on robust and adaptable AI agents, we plan to release the GAIA-Ext dataset and our experimental source code to the public in the future.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100997
DOI:	10.6342/NTU202504501
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2025-11-27
Appears in Collections:	資訊網路與多媒體研究所

Files in This Item:

File	Size	Format
ntu-114-1.pdf Access limited in NTU ip range	1.63 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets