請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101156完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 黃明蕙 | zh_TW |
| dc.contributor.advisor | Ming-Hui Huang | en |
| dc.contributor.author | 王彥碩 | zh_TW |
| dc.contributor.author | Yan-Shuo Wang | en |
| dc.date.accessioned | 2025-12-31T16:08:50Z | - |
| dc.date.available | 2026-01-01 | - |
| dc.date.copyright | 2025-12-31 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-12-18 | - |
| dc.identifier.citation | Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., & Hesse, C. (2020). Language Models are Few-Shot Learners. Advances in neural information processing systems, 33, 1877-1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Chen, Y., & Prentice, C. (2024). Integrating Artificial Intelligence and Customer Experience. Australasian Marketing Journal. https://doi.org/10.1177/14413582241252904 Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 4299–4307. Fu, D., He, K., Wang, Y., Hong, W., Gongque, Z., Zeng, W., Wang, W., Wang, J., Cai, X., & Xu, W. (2025). Agentrefine: Enhancing agent generalization through refinement tuning. arXiv preprint arXiv:2501.01702. Furniturewala, S., Jandial, S., Java, A., Banerjee, P., Shahid, S., Bhatia, S., & Jaidka, K. (2024). “Thinking” Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)(Association for Computational Linguistics), 213-227. https://doi.org/10.18653/v1/2024.emnlp-main.13 (Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing) Ho, D. H., & Fan, C. (2025). Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning. arXiv preprint arXiv:2506.16064. Huang, A., Block, A., Foster, D. J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J. T., & Krishnamurthy, A. (2025). Self-Improvement in Language Models: The Sharpening Mechanism The Thirteenth International Conference on Learning Representations (ICLR), https://openreview.net/forum?id=WJaUkwci9o Huang, M.-H., & Rust, R. T. (2018). Artificial intelligence in service. Journal of service research, 21(2), 155-172. Huang, M.-H., & Rust, R. T. (2024). Automating Creativity. https://arxiv.org/abs/2405.06915 Huang, M.-H., & Rust, R. T. (2024). The Caring Machine: Feeling AI for Customer Care. Journal of Marketing, 88(5), 1-23. https://doi.org/10.1177/00222429231224748 Jiang, D., Zhang, J., Weller, O., Weir, N., Van Durme, B., & Khashabi, D. (2025). Self-[in] correct: Llms struggle with discriminating self-generated responses Proceedings of the AAAI Conference on Artificial Intelligence, Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., Zhou, X., Wang, E., & Dong, X. (2024). Better Zero-Shot Reasoning with Role-Play Prompting. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), 4099-4113. https://doi.org/10.18653/v1/2024.naacl-long.228 (Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)) Mendonça, J., Lavie, A., & Trancoso, I. (2024, August). On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation The 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024), Bangkok, Thailand. https://aclanthology.org/2024.nlp4convai-1.1/ Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., & Stanovsky, G. (2024). State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics, 12, 933-949. https://doi.org/10.1162/tacl_a_00681 Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730–27744. Pan, L., Saxon, M., Xu, W., Nathani, D., Wang, X., & Wang, W. Y. (2024). Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies. Transactions of the Association for Computational Linguistics, 12, 484-506. https://doi.org/10.1162/tacl_a_00660 Shawar, B. A., & Atwell, E. (2007). Chatbots: are they really useful? Journal for Language Technology and Computational Linguistics, 22(1), 29-49. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in neural information processing systems, 33, 3008-3021. Tueanrat, Y., Papagiannidis, S., & Alamanos, E. (2021). Going on a Journey: A Review of the Customer Journey Literature. Journal of Business Research, 125, 336-353. https://doi.org/10.1016/j.jbusres.2020.12.028 Vinyals, O., & Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869. Wei, J., Kim, S., Jung, H., & Kim, Y.-H. (2024). Leveraging Large Language Models to Power Chatbots for Collecting User Self-Reported Data. Proc. ACM Hum.-Comput. Interact., 8(CSCW1), Article 87. https://doi.org/10.1145/3637364 Wulf, J., & Meierhofer, J. (2024a). Exploring the potential of large language models for automation in technical customer service. arXiv preprint arXiv:2405.09161. Wulf, J., & Meierhofer, J. (2024b). Utilizing Large Language Models for Automating Technical Customer Support. arXiv preprint arXiv:2406.01407. https://doi.org/10.48550/arXiv.2406.01407 Xu, Y., Shieh, C.-H., & van Esch, P. (2020). AI Customer Service: Task Complexity, Problem-Solving Ability, and Usage Intention. Australasian Marketing Journal (AMJ). https://doi.org/10.1016/j.ausmj.2020.03.005 Zhang, Z., Peng, L., Pang, T., Han, J., Zhao, H., & Schuller, B. W. (2023). Refashioning Emotion Recognition Modeling: The Advent of Generalized Large Models. arXiv preprint arXiv:2308.11578. https://arxiv.org/abs/2308.11578 Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Advances in neural information processing systems, https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101156 | - |
| dc.description.abstract | 近年來,大型語言模型 (LLMs) 已經被大量整合至客服領域。然而,LLM 在應用於現實客服場景時仍會面臨許多問題,包括回應品質不一致、情感表達平淡,以及難以理解多階段互動中的複雜語境。本研究提出一個應用於四階段客戶服務旅程,的推論時、無需微調的獎勵工程框架上,以模擬現實世界的實務問題。我們透過不同的實驗條件(基線、閾值、關鍵評估者與外部評估者)處理了 7 則富含情感的客戶抱怨推文,研究發現:(1) 實務的客戶服務情境需要建立穩健的品質底線 (quality floor);(2) 關鍵評估者的效率最低,消耗最多的 Token 且平均迭代次數最高(3.07 次,相較於其他條件約 1.5 次);(3) 雖然外部評估者表現最佳(成功率 98.95%),但在預算有限的情況下,閾值條件 (Threshold condition) 是最推薦的選擇。本研究提供了一個探索性研究的範例,旨在測試獎勵工程方法是否能提升回應效率,並發現不同的基礎模型在給定相同的評估標準 (rubric) 下,實際上能彼此達成共識。結果顯示,當客戶服務回應必須達到高標準時,迭代優化機制是不可或缺的。此外,在應用此框架時,「以 LLM 為評審 (LLM-as-a-Judge)」扮演了穩健的評估角色。 | zh_TW |
| dc.description.abstract | Recently, Large Language Models (LLMs) have been widely integrated into the customer service field. However, LLMs still face obstacles when applied to real-world scenarios. Issues include inconsistent responses, emotional flatness, and a failure to understand the complex context of multi-stage interactions. This study proposes an inference-time, fine-tuning-free reward engineering framework applied to the four-stage customer care journey to simulate real-world practical problems. We processed 7 emotionally rich tweets through different experimental conditions (Baseline, Threshold, Critical Evaluator, and External Evaluator) and found that: (1) A robust quality floor is needed for practical customer service scenarios. (2) The Critical Evaluator was the least efficient, requiring the most tokens and having the highest average iteration counts (3.07 iterations, compared to the others at around 1.5). (3) While the External Evaluator achieved the best performance (98.95% success rate), the Threshold condition is the most recommended under a tight budget. This study provides an example of exploratory research to test whether the reward engineering approach optimizes response efficiency, and finds that different base models can actually reach a consensus with each other given the same evaluation rubric. The results show that an iterative refinement mechanism is essential when the customer care responses must meet a high standard. Furthermore, LLM-as-a-Judge serves as a robust evaluation role when applying this framework. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-12-31T16:08:50Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-12-31T16:08:50Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 i
摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vii LIST OF TABLES viii Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research Questions 3 1.3 Research Objectives 3 1.4 Research Hypothesis 4 1.5 Thesis Organization 5 Chapter 2 Literature Review 7 2.1 Introduction to AI in Customer Services 7 2.1.1 The Multi-Stage Customer Care Journey 9 2.2 Challenges in Real-World Customer Service 11 2.3 Structured Approaches to In-Context Control 12 2.3.1 Structured Prompt Design and Role-Play Prompting 13 2.3.2 Iterative Prompt Design for Refinement 14 2.4 LLM Optimization: From Self-Refinement to Collaborative AI Agent Framework 14 2.4.1 Self-Refinement: Rise and Limitations 15 2.4.2 Collaborative Correction in AI Agent Workflows 16 2.4.3 Evaluation via LLM-as-a-Judge 17 2.5 Summary and Conceptual Framework 18 Chapter 3 Methodology 20 3.1 Overview of Research Framework 20 3.2 Dataset Design and Message Control 22 3.3 Models and Tool Selection 24 3.4 Experimental Design and Agent Framework 25 3.5 Evaluation Protocol and Metrics 26 3.6 Experimental Setup 27 Chapter 4 Experiment and Results 28 4.1 Quantitative Analysis 28 4.1.1 Overall Performance and Efficiency 28 4.1.2 Stage-Wise Performance and Challenge Identification 30 4.1.3 Stage-Wise Average Iterations 31 4.2 Qualitative Analysis: A Case Study in Refinement 32 4.3 Summary of Key Findings 34 Chapter 5 Discussion and Conclusion 36 5.1 Interpretation of Findings 36 5.2 Theoretical Implications 37 5.3 Practical Implications 38 5.4 Research Limitations 38 5.5 Future Research and Conclusion 39 REFERENCES 41 Appendix A Prompt Templates & Evaluation Rubrics 43 A.1 Selected Customer Tweets for Experiments 43 A.2 Core Agent Prompts 43 A.2.1 System Prompt 43 A.2.2 Stage Generation Prompts 44 A.3 Evaluation Prompts 44 A.3.1 Standard Evaluation Prompts 44 A.3.2 Critical Evaluation Prompts 48 A.3.3 External Critique Prompts 49 A.4 Evaluation Rubrics 52 | - |
| dc.language.iso | en | - |
| dc.subject | 客戶服務旅程 | - |
| dc.subject | 大型語言模型 | - |
| dc.subject | 上下文學習 | - |
| dc.subject | 獎勵工程 | - |
| dc.subject | LLM-as-a-Judge | - |
| dc.subject | Customer Care Journey | - |
| dc.subject | Large Language Model (LLM) | - |
| dc.subject | In-Context Learning | - |
| dc.subject | Reward Engineering | - |
| dc.subject | LLM-as-a-Judge | - |
| dc.title | 運用獎勵工程提升顧客服務旅程體驗 | zh_TW |
| dc.title | Using Reward Engineering to Enhance Customer Care Journey | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 114-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 何承遠;張瑋倫 | zh_TW |
| dc.contributor.oralexamcommittee | Cheng-Yuan Ho;Wei-Lun Chang | en |
| dc.subject.keyword | 客戶服務旅程,大型語言模型上下文學習獎勵工程LLM-as-a-Judge | zh_TW |
| dc.subject.keyword | Customer Care Journey,Large Language Model (LLM)In-Context LearningReward EngineeringLLM-as-a-Judge | en |
| dc.relation.page | 53 | - |
| dc.identifier.doi | 10.6342/NTU202504806 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-12-18 | - |
| dc.contributor.author-college | 管理學院 | - |
| dc.contributor.author-dept | 資訊管理學系 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 資訊管理學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-114-1.pdf 未授權公開取用 | 1.1 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
