請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97563完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 許永真 | zh_TW |
| dc.contributor.advisor | Jane Yung-jen Hsu | en |
| dc.contributor.author | 林鴻儒 | zh_TW |
| dc.contributor.author | Hung-Ju Lin | en |
| dc.date.accessioned | 2025-07-02T16:28:53Z | - |
| dc.date.available | 2025-07-03 | - |
| dc.date.copyright | 2025-07-02 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2025-06-19 | - |
| dc.identifier.citation | [1] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
[2] A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021. [3] J. Fu, S.-K. Ng, Z. Jiang, and P. Liu. GPTScore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023. [4] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. [5] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023. [6] P. Ke, B. Wen, Z. Feng, X. Liu, X. Lei, J. Cheng, S. Wang, A. Zeng, Y. Dong, H. Wang, et al. CritiqueLLM: Scaling LLM-as-Critic for effective and explainable evaluation of Large Language Model generation. arXiv preprint arXiv:2311.18702, 2023. [7] S. Kim, J. Shin, Y. Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2023. [8] S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, and M. Seo. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024. [9] J. Li, S. Sun, W. Yuan, R.-Z. Fan, H. Zhao, and P. Liu. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023. [10] R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. [11] C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. [12] M. Liu, Y. Shen, Z. Xu, Y. Cao, E. Cho, V. Kumar, R. Ghanadan, and L. Huang. XEval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects. arXiv preprint arXiv:2311.08788, 2023. [13] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu. G-Eval: Nlg evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023. [14] Y. Liu, N. S. Moosavi, and C. Lin. LLMs as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766, 2023. [15] B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1, 2020. [16] OpenAI. ChatGPT, 2022. Accessed: 2024-11-07. [17] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. [18] A. Panickssery, S. R. Bowman, and S. Feng. LLM evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024. [19] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. [20] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. [21] J. Shi, Z. Yuan, Y. Liu, Y. Huang, P. Zhou, L. Sun, and N. Z. Gong. Optimizationbased prompt injection attack to LLM-as-a-Judge. arXiv preprint arXiv:2403.17710, 2024. [22] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295, 2024. [23] H. Tianxing, Z. Jingyu, W. Tianle, K. Sachin, and T. Yulia. On the blind spots of model-based evaluation metrics for text generation. arXiv preprint arXiv: 2212.10020, 2022. [24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. [25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [26] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao, Q. Liu, T. Liu, and Z. Sui. Large Language Models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023. [27] Y. Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie, et al. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization.(2024). URL https://arxiv. org/abs/2306.05087, 3(4), 2024. [28] K. Wataoka, T. Takahashi, and R. Ri. Self-preference bias in LLM-as-a-Judge. arXiv preprint arXiv:2410.21819, 2024. [29] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021. [30] W. Yuan, G. Neubig, and P. Liu. BARTScore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021. [31] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36, 2024. [32] M. Zhong, Y. Liu, D. Yin, Y. Mao, Y. Jiao, P. Liu, C. Zhu, H. Ji, and J. Han. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197, 2022. [33] L. Zhu, X. Wang, and X. Wang. JudgeLM: Fine-tuned Large Language Models are scalable judges. arXiv preprint arXiv:2310.17631, 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97563 | - |
| dc.description.abstract | 自我提升偏誤(self-enhancement bias)是指模型傾向於對其自身生成的回答給予過高評分的現象。然而,我們發現低品質回答往往會產生較高的偏誤分數,且這種現象在能力較低的模型中更為明顯。這反映了現有測量方法的一個重要局限性——容易受到回答品質和模型能力的干擾,導致結果不準確。
為了解決此問題,我們提出了一種新方法——SALIERI,透過配對品質相似的回答來消除這些干擾,從而量化真正的偏誤程度。實驗結果顯示,在Summeval資料集上,SALIERI將LLaMA 2 7B模型中高品質與低品質回答的偏誤分數差距從2.55降至0.75,並將能力較強的Llama 3 8B模型的差距從1.44降至0.69。這些結果表明,SALIERI能有效提高自我提升偏誤測量的準確性與可靠性。 | zh_TW |
| dc.description.abstract | Self-enhancement bias refers to the tendency of models to overrate their own responses. However, we found that lower-quality responses tend to produce higher bias scores, an issue exacerbated in less capable models. This highlights a key limitation of current measurement methods, which are confounded by response quality and model capability, leading to inaccuracies.
To address this, we propose SALIERI, a method that pairs responses of similar quality to isolate true bias. Experiments on the Summeval dataset show that SALIERI reduced bias score gaps between high- and low-quality responses from 2.55 to 0.75 for the less capable LLaMA 2 7B model and from 1.44 to 0.69 for the more advanced Llama 3 8B model. These results demonstrate SALIERI’s effectiveness in achieving more accurate and reliable measurements of self-enhancement bias. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-02T16:28:53Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-07-02T16:28:53Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 iii Abstract v Contents vii List of Figures xi List of Tables xiii Denotation xv Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 3 1.3 Research Objective 4 1.4 Thesis Organization 4 Chapter 2 Related Work 7 2.1 Natural Language Generation Evaluation 7 2.2 Large Language Model (LLM) 8 2.3 Self-Enhancement Bias 9 Chapter 3 Problem Formulation 11 3.1 Self-Enhancement Bias in Pointwise Evaluation 11 3.2 Pointwise Natural Language Generation Evaluation 12 3.3 Confounding in Measuring Self-Enhancement Bias 13 3.3.1 Alternative Explanation 13 3.3.2 Evaluation Capabilities 14 Chapter 4 Response Quality Impact 19 4.1 Dataset Selection and Model Selection 19 4.1.1 Dataset Selection 19 4.1.2 Model Selection 20 4.2 Experiment Flows 21 4.3 Experiment Results 24 4.3.1 Self Response Setting 24 4.3.2 Other Response Setting 24 4.3.3 Model Capability Relationship 25 4.4 Conclusion 26 Chapter 5 SALIERI 29 5.1 Main Idea 30 5.2 Pairing Self-Enhancement Bias Score 30 5.3 Proposed Method 31 5.3.1 Overall Process 31 5.3.2 Paired Response Pool Generation 33 5.4 Experiment 33 5.4.1 Experimental Setup 33 5.4.2 Experimental Design 34 5.5 Results 34 5.5.1 Self-response setting 36 5.5.2 Other-response setting 36 5.6 Conclusion 38 Chapter 6 Discussion and Conclusion 41 6.1 Discussion 41 6.1.1 Weaker Models and Reduced Bias 41 6.1.2 Variation in Performance Across Datasets 41 6.1.3 Score Rubrics 42 6.2 Limitations 42 6.2.1 Dependency with Dataset 42 6.2.2 Pairing Response Constraint 43 6.2.3 Reference Agents Constraint 43 6.3 Conclusion 44 References 45 Appendix A — Evaluation Prompt 51 Appendix B — Treatment Check 53 Appendix C — Summeval Preparation 55 C.1 Scoring Rubric Generation 55 C.2 Diverse Response Generation 61 | - |
| dc.language.iso | en | - |
| dc.subject | 大型語言模型用評估 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 自我提升偏誤 | zh_TW |
| dc.subject | Self-Enhancement Bias | en |
| dc.subject | LLM-as-a-judge | en |
| dc.subject | Large Language Model | en |
| dc.title | SALIERI: 透過配對相似品質的回答系統性量測大型語 言模型評估者的自我增強偏誤 | zh_TW |
| dc.title | SALIERI: Systematic Assessment of Self-enhancement Bias in Large Language Model Evaluators by Pairing Similar Quality Response | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 陳縕儂 | zh_TW |
| dc.contributor.coadvisor | Yun-Nung Chen | en |
| dc.contributor.oralexamcommittee | 李育杰;蔡宗翰;陳尚澤 | zh_TW |
| dc.contributor.oralexamcommittee | Yuh-Jye Lee;Tzong-Han Tsai;Shang-Tse Chen | en |
| dc.subject.keyword | 大型語言模型用評估,大型語言模型,自我提升偏誤, | zh_TW |
| dc.subject.keyword | LLM-as-a-judge,Large Language Model,Self-Enhancement Bias, | en |
| dc.relation.page | 62 | - |
| dc.identifier.doi | 10.6342/NTU202500980 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-06-19 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2025-07-03 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 564.36 kB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
