請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97913完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 陳信希 | zh_TW |
| dc.contributor.advisor | Hsin-Hsi Chen | en |
| dc.contributor.author | 林緯翔 | zh_TW |
| dc.contributor.author | Wei-Hsiang Lin | en |
| dc.date.accessioned | 2025-07-23T16:04:53Z | - |
| dc.date.available | 2025-07-24 | - |
| dc.date.copyright | 2025-07-23 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-16 | - |
| dc.identifier.citation | References
Sher Badshah and Hassan Sajjad. Reference-guided verdict: Llms-as-judges in automatic evaluation of free-form text, 2024. URL https://arxiv.org/abs/2408.09235. Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. OceanGPT: A large language model for ocean science tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3357–3372, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.184. URL https://aclanthology.org/2024.acl-long.184/. Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=FQepisCUWu. Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327, Miami, Florida, USA, November 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024. emnlp-main.474. URL https://aclanthology.org/2024.emnlp-main.474/. Justin Chen, Swarnadeep Saha, and Mohit Bansal. ReConcile: Round-table confer- ence improves reasoning via consensus among diverse LLMs. In Lun-Wei Ku, An- dre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7066–7085, Bangkok, Thailand, August 2024b. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.381. URL https://aclanthology.org/2024.acl-long.381/. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 8359–8388. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/chiang24b.html. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, 2021. URL http://arxiv.org/abs/2110.14168.arXiv:2110.14168. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, Jidong Ge, and Vincent Ng. LawBench: Benchmarking legal knowledge of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7933–7962, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. emnlp-main.452. URL https://aclanthology.org/2024.emnlp-main.452/. Google Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023. URL https://arxiv.org/pdf/2312.11805. Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021a. URL https:// openreview.net/forum?id=d7KBjmI3GmQ. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021b. Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in llm-based preference evaluations, 2024. URL https://arxiv.org/abs/2407. 01085. Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwa- sawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf. Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 517–545, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl. 29. URL https://aclanthology.org/2024.findings-acl.29/. Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, and Youngjae Yu. Aligning large language models by on-policy self-judgment. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11442–11459, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.617. URL https://aclanthology.org/2024.acl-long.617/. Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From generation to judgment: Opportunities and challenges of llm-as-a-judge, 2025. URL https://arxiv.org/abs/2411.16594. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URL https://arxiv.org/abs/2412.05579. Jingcong Liang, Rong Ye, Meng Han, Ruofei Lai, Xinyu Zhang, Xuanjing Huang, and Zhongyu Wei. Debatrix: Multi-dimensional debate judge with iterative chronological analysis based on LLM. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 14575– 14595, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.868. URL https://aclanthology.org/2024.findings-acl.868/. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/. Yen-Ting Lin and Yun-Nung Chen. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Yun-Nung Chen and Abhinav Rastogi, editors, Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pages 47–58, Toronto, Canada, July 2023. As- sociation for Computational Linguistics. doi: 10.18653/v1/2023.nlp4convai-1.5. URL https://aclanthology.org/2023.nlp4convai-1.5/. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main. 153. URL https://aclanthology.org/2023.emnlp-main.153/. Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathemat- ical reasoning in large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=AjXkRZIvjB. Mistral AI Team. Large Enough, 2024a. URL https://mistral.ai/news/ mistral-large-2407/. Section: news. Mistral AI Team. Un Ministral, des Ministraux, 2024b. URL https://mistral.ai/news/ministraux/. Section: news. Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recog- nize and favor their own generations. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 68772–68802. Curran Associates, Inc., 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/7f1f0218e45f5414c79c0679633e47bc-Paper-Conference.pdf. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040/. Vyas Raina, Adian Liusie, and Mark Gales. Is LLM-as-a-judge robust? investigating uni- versal adversarial attacks on zero-shot LLM assessment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7499–7517, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. emnlp-main.427. URL https://aclanthology.org/2024.emnlp-main.427/. Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush Vosoughi. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024. URL http: //arxiv.org/abs/2406.07791. arXiv:2406.07791 [cs]. Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for eval- uating LLM-based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq. Gemma Team. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786. Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models, 2024. URL https://arxiv.org/abs/2404.18796. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Ling- peng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, Bangkok, Thailand, August 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.511. URL https: //aclanthology.org/2024.acl-long.511/. Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=5Nn2BLV7SB. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max KU, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 95266–95290. Curran Associates, Inc., 2024c. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track.pdf. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elic- its reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen Huang, and Hsin-Hsi Chen. Unveiling selection biases: Exploring order and token sensitivity in large language models. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 5598–5621, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl. 333. URL https://aclanthology.org/2024.findings-acl.333/. Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia. MR- GSM8k: A meta-reasoning benchmark for large language model evaluation. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=br4H61LOoI. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf. Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned Large Lan- guage Models are Scalable Judges, 2023. URL http://arxiv.org/abs/2310.17631. arXiv:2310.17631 [cs]. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97913 | - |
| dc.description.abstract | 「大型語言模型作為裁判」的框架在人工智慧評估中日益受到重視,然而關於模型的生成能力與判斷能力之間關係的研究結果卻仍不一致。我們透過系統性的資料集層級與樣本層級分析,針對 11 個模型與 21 種多樣任務,深入探討這一關係。儘管這兩種能力皆依賴於相同的基礎知識,我們的分析顯示它們之間僅存在微弱的相關性,主要原因在於大型語言模型對被評估答案的敏感性。為了解決此問題,我們提出一種自我參照引導的評估策略,利用模型自身的回答作為參考標準。此方法顯著增強了生成能力與判斷能力之間的關聯性,提供了一個實用的途徑來對齊這兩種技能,並因此為評估任務中的模型選擇提供了一個可靠的替代指標。 | zh_TW |
| dc.description.abstract | LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models' generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model's own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and, as a result, providing a reliable proxy for model selection in evaluation tasks. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-23T16:04:53Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-07-23T16:04:53Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Contents
Acknowledgements i 摘要 iii Abstract iv Contents v List of Figures vii List of Tables ix Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Related Work 5 2.1 LLM-as-Judge Framework . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Reliability and Limitations of LLM Judges . . . . . . . . . . . . . . 6 2.3 Generation-Evaluation Capability Relationship . . . . . . . . . . . . 7 Chapter 3 Methodology 9 3.1 Objective & Notations . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Evaluation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Evaluation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Prompt Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.1 Answer Generation Prompts . . . . . . . . . . . . . . . . . . . . . 15 3.4.2 Answer Judgment Prompts . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Rationale for Pointwise Evaluation . . . . . . . . . . . . . . . . . . 16 Chapter 4 Relationship Between Answer Generation and Answer Judgment 19 4.1 Dataset-Level Observations . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Dataset-Level Finer-Grained Observations . . . . . . . . . . . . . . 22 4.3 Instance-Level Observations . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 5 Self-Reference-Guided Evaluation 31 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 6 Discussion 37 Chapter 7 Conclusion, Limitation and Future Work 38 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.2 Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . 39 7.2.1 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.2.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 References 44 | - |
| dc.language.iso | en | - |
| dc.subject | 大型語言模型作為裁判 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 大型語言模型作為裁判 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | LLM-as-Judge | en |
| dc.subject | Large Language Model | en |
| dc.subject | LLM-as-Judge | en |
| dc.subject | Large Language Model | en |
| dc.title | 基於自我參照引導校準方法提升大型語言模型生成與判斷能力關聯性之研究 | zh_TW |
| dc.title | Self-Reference-Guided Calibration for Enhancing the Correlation between Generation and Judgment Capabilities in Large Language Models | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 陳冠宇;黃乾綱;黃瀚萱 | zh_TW |
| dc.contributor.oralexamcommittee | Kuan-Yu Chen;Chien-Kang Huang;Hen-Hsen Huang | en |
| dc.subject.keyword | 大型語言模型,大型語言模型作為裁判, | zh_TW |
| dc.subject.keyword | Large Language Model,LLM-as-Judge, | en |
| dc.relation.page | 53 | - |
| dc.identifier.doi | 10.6342/NTU202501902 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-07-17 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2030-07-15 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 986.16 kB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
