Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100997
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor林守德zh_TW
dc.contributor.advisorShou-De Linen
dc.contributor.author吳耘安zh_TW
dc.contributor.authorYun-Ang Wuen
dc.date.accessioned2025-11-26T16:24:21Z-
dc.date.available2025-11-27-
dc.date.copyright2025-11-26-
dc.date.issued2025-
dc.date.submitted2025-10-01-
dc.identifier.citation[1] M. Alakuijala, Y. Gao, G. Ananov, S. Kaski, P. Marttinen, A. Ilin, and H. Valpola. Memento no more: Coaching ai agents to master multiple tasks via hints internalization. arXiv preprint arXiv:2502.01562, 2025.
[2] D. Bahdanau, N. Gontier, G. Huang, E. Kamalloo, R. Pardinas, A. Piché, T. Scholak, O. Shliazhko, J. P. Tremblay, K. Ghanem, S. Parikh, M. Tiwari, and Q. Vohra. Tapeagents: a holistic framework for agent development and optimization. arXiv preprint arXiv:2412.08445, 2024.
[3] Z. Cai, B. Chang, and W. Han. Human-in-the-loop through chain-of-thought. arXiv preprint arXiv:2306.07932, 2023.
[4] Y. Chen, J. Arkin, Y. Hao, Y. Zhang, N. Roy, and C. Fan. PRompt optimization in multi-step tasks (PROMST): Integrating human feedback and heuristic-based sampling. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3859–3920, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics.
[5] DataCanary, hilfialkaff, L. Jiang, M. Risdal, N. Dandekar, and tomtung. Quora Question Pairs. https://kaggle.com/competitions/quora-question-pairs, 2017. Accessed: 2025-09-30.
[6] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: towards a generalist agent for the web. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc.
[7] W. Epperson, G. Bansal, V. C. Dibia, A. Fourney, J. Gerrits, E. E. Zhu, and S. Amershi. Interactive debugging and steering of multi-agent ai systems. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY, USA, 2025. Association for Computing Machinery.
[8] A. Fourney, G. Bansal, H. Mozannar, C. Tan, E. Salinas, Erkang, Zhu, F. Niedtner, G. Proebsting, G. Bassman, J. Gerrits, J. Alber, P. Chang, R. Loynd, R. West, V. Dibia, A. Awadallah, E. Kamar, R. Hosn, and S. Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv preprint arXiv:2411.04468, 2024.
[9] H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu. WebVoyager: Building an end-to-end web agent with large multimodal models. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics.
[10] M.-W. Kim, H.-B. Park, H.-J. Ahn, W.-R. Park, J.-W. Jeon, K.-H. Lee, R. Lee, and D.-G. Choi. Autopaperbench: An mllm-based framework for automatic generation of paper understanding evaluation benchmarks. Electronics, 14(6):1175, 2025.
[11] J. Li and R. Klinger. iPrOp: Interactive prompt optimization for large language models with a human in the loop. arXiv preprint arXiv:2412.12644, 2024.
[12] K. Li and Y. Zhang. Planning first, question second: An llm-guided method for controllable question generation. In Findings of the Association for Computational Linguistics ACL 2024, pages 4715–4729, 2024.
[13] X. Lin, Z. Dai, A. Verma, S.-K. Ng, P. Jaillet, and B. K. H. Low. Prompt optimization with human feedback. arXiv preprint arXiv:2405.17346, 2024.
[14] G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors, International Conference on Representation Learning, volume 2024, pages 9025–9049, 2024.
[15] S. S. Mucciaccia, T. M. Paixão, F. W. Mutz, C. S. Badue, A. F. de Souza, and T. Oliveira-Santos. Automatic multiple-choice question generation and evaluation systems based on llm: A study case with university resolutions. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2246–2260, 2025.
[16] D. Nguyen, V. D. Lai, S. Yoon, R. A. Rossi, H. Zhao, R. Zhang, P. Mathur, N. Lipka, Y. Wang, T. Bui, F. Dernoncourt, and T. Zhou. Dynasaur: Large language agents beyond predefined actions. In Second Conference on Language Modeling, 2025.
[17] OpenAI. Introducing deep research. https://openai.com/index/introducing-deep-research/, 2025. Accessed: 2025-05-27.
[18] A. Roucher, A. V. del Moral, T. Wolf, L. von Werra, and E. Kaunismäki. smolagents: a smol library to build great agentic systems. https://github.com/huggingface/smolagents, 2025. Accessed: 2025-09-30.
[19] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023.
[20] D. Shi, J. Cao, Q. Chen, W. Sun, W. Li, H. Lu, F. Dong, T. Qin, K. Zhu, M. Liu, et al. Taskcraft: Automated generation of agentic tasks. arXiv preprint arXiv:2506.10055, 2025.
[21] S. Tian, Z. Zhang, L. Chen, and Z. Liu. MMInA: Benchmarking multihop multimodal Internet agents. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 13682–13697, Vienna, Austria, July 2025. Association for Computational Linguistics.
[22] K. Vergopoulos, M. N. Müller, and M. Vechev. Automated benchmark generation for repository-level coding tasks. arXiv preprint arXiv:2503.07701, 2025.
[23] Y. Wang, T. Shen, L. Liu, and J. Xie. Sibyl: Simple yet effective agent framework for complex real-world reasoning. arXiv preprint arXiv:2407.10718, 2024.
[24] J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516, 2025.
[25] Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling, 2024.
[26] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al. The rise and potential of large language model based agents: A survey. Science China Information Sciences, 68(2):121101, 2025.
[27] C.-P. Yang, K. Zheng, and S.-D. Lin. PLHF: Prompt optimization with few-shot human feedback. arXiv preprint arXiv:2505.07886, 2025.
[28] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
[29] S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
[30] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
[31] S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, J. Zhang, B. Li, C. Wang, H. Wang, Y. Chen, and Q. Wu. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems. arXiv preprint arXiv:2505.00212, 2025.
[32] Y. Zhang, J. Baldridge, and L. He. PAWS: Paraphrase adversaries from word scrambling. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100997-
dc.description.abstract近年來,鑒於大型語言模型發展快速,AI 代理具備了更為複雜的推論、規劃與工具使用的能力。但是在真實世界的應用場景中,AI 代理往往需要針對特定領域或使用者情境進行調整及適應。這種方式雖然提升了 AI 代理在此情境的專業性,但也增加了過度擬合於訓練樣本的風險,並減弱了其在類題上的泛化能力。因為現有的基準測試資料集通常缺乏對齊的類題做測試使用,它們並不能充分評估代理的問題層級的泛化。
為了解決這一缺口,我們開發了一個用來產生類題的問題生成框架,並以此為基礎,構建了一種以 LLM 為基礎的自動生成問題系統,並透過嚴謹的人工評估驗證了其有效性。此外,我們通過擴展 GAIA 資料集建構了 GAIA-Ext,一個全新的基準資料集。這個資料集使得研究者能夠以更為嚴謹的方式評估 AI 代理的泛化能力。
我們的實驗研究顯示,GAIA-Ext 能夠有效揭示當前 AI 代理的局限性,以及觀察到其過擬合的現象。為了推動 AI 代理的進一步研究,我們會在未來將 GAIA-Ext 數據集與實驗程式碼公開釋出,供學術研究使用。
zh_TW
dc.description.abstractRecent advances in large language models have empowered AI agents with sophisticated reasoning, planning, and tool use abilities. However, in real-world deployments, AI agents often require adaptation to specific domains or user contexts. This adaptation increases the risk of overfitting to training samples and reduces their ability to generalize to related but unseen problems. Existing benchmarks do not provide sufficient evaluation of problem-level generalization, because they lack systematically constructed and structurally aligned test variants for each validation query.
To address this gap, we developed a question generation framework to produce structurally similar variants of agent-based questions. We also constructed an LLM-based method to automatically generate such questions, and we performed robust human evaluation to prove its effectiveness. In addition, we introduce GAIA-Ext, a new benchmark that extends the well-known GAIA benchmark dataset by associating each original task with multiple problem-level aligned variants. This design enables principled assessment of agent generalization beyond observed examples.
Our empirical studies demonstrate that GAIA-Ext supports robust diagnosis of overfitting and exposes limitations in current agent adaptation strategies. To encourage further research on robust and adaptable AI agents, we plan to release the GAIA-Ext dataset and our experimental source code to the public in the future.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:24:20Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-11-26T16:24:21Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌謝 i
摘要 ii
Abstract iii
Contents v
List of Figures vii
List of Tables x
Chapter 1 Introduction 1
Chapter 2 Related Work 5
Chapter 3 Preliminaries 8
3.1 AI Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 AI Agent Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 GAIA Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 4 Methodology 11
4.1 Problem-Level Alignment Principles . . . . . . . . . . . . . . . . . 11
4.2 Automatic Framework Design . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Framework Architecture . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Framework Validation . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 5 Dataset Construction 17
5.1 Overview of GAIA-Ext . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Annotation Principles for GAIA-Ext . . . . . . . . . . . . . . . . . . 19
5.3 Dataset Curation and Quality Control . . . . . . . . . . . . . . . . . 20
Chapter 6 Experiments 22
6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Experimental Results Using GAIA-Ext . . . . . . . . . . . . . . . . 24
6.3 Experimental Results Using Automatically Generated Questions . . . 27
Chapter 7 Conclusion 29
References 30
Appendix A — Prompts Used for Question Generation 36
Appendix B — Prompts Used for Experiments 42
Appendix C — Examples of Automatically Generated Questions 45
Appendix D — Annotation Considerations 47
-
dc.language.isoen-
dc.subjectAI代理-
dc.subject大型語言模型-
dc.subject資料集-
dc.subject問題層級對齊-
dc.subject問題生成-
dc.subject上下文學習-
dc.subject提示工程-
dc.subjectAI Agents-
dc.subjectLarge Language Models-
dc.subjectDatasets-
dc.subjectProblem-Level Alignment-
dc.subjectQuestion Generation-
dc.subjectIn-Context Learning-
dc.subjectPrompt Engineering-
dc.titleGAIA-Ext:以問題層級對齊資料集對 AI 代理過擬合進行基準測試zh_TW
dc.titleGAIA-Ext: Benchmarking AI Agent Overfitting with a Problem-Level Aligned Dataseten
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳縕儂;孫紹華zh_TW
dc.contributor.oralexamcommitteeYun-Nung Chen;Shao-Hua Sunen
dc.subject.keywordAI代理,大型語言模型資料集問題層級對齊問題生成上下文學習提示工程zh_TW
dc.subject.keywordAI Agents,Large Language ModelsDatasetsProblem-Level AlignmentQuestion GenerationIn-Context LearningPrompt Engineeringen
dc.relation.page48-
dc.identifier.doi10.6342/NTU202504501-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-10-01-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
dc.date.embargo-lift2025-11-27-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
1.63 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved