大型語言模型推論中語意快取的能耗與效能權衡：相似度閾值調整的實證分析

陳冠錞; Kuan-Chun Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98714

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	謝尚賢	zh_TW
dc.contributor.advisor	Shang-Hsien Hsieh	en
dc.contributor.author	陳冠錞	zh_TW
dc.contributor.author	Kuan-Chun Chen	en
dc.date.accessioned	2025-08-18T16:12:24Z	-
dc.date.available	2025-08-19	-
dc.date.copyright	2025-08-18	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-07	-
dc.identifier.citation	[1] Charlotte Freitag, Mike Berners-Lee, Kelly Widdicks, Bran Knowles, Gordon S. Blair, and Adrian Friday. The real climate and transformative impact of ict: A critique of estimates, trends, and regulations. Patterns, 2(9):100340, 2021. [2] Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 85–99, 2024. [3] Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of machine learning and systems, 4:795–813, 2022. [4] David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training will plateau, then shrink. Computer, 55(7):18–28, 2022. [5] Aled James and Daniel Schien. A low carbon kubernetes scheduler. In ICT4S, 2019. [6] Andrew A Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana. Reducing the Carbon Impact of Generative AI Inference (today and in 2035). In Proceedings of the 2nd Workshop on Sustainable Computer Systems, pages 1–7, Boston MA USA, July 2023. ACM. [7] Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference, March 2024. arXiv:2403.12900 [cs]. [8] Fu Bang. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Singapore, Singapore, 2023. Empirical Methods in Natural Language Processing. [9] Green Software Practitioner. https://learn.greensoftware.foundation/introduction/, 2024. [10] Green Software Patterns. https://patterns.greensoftware.foundation/, 2024. [11] Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, and Danyang Zhuo. Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, January 2024. arXiv:2401.12230 [cs]. [12] Erik De Castro Lopo. Cloud-Native AI: Building and Deploying Large Language Models with Scalable Cloud Architectures. 1(8), 2024. [13] Suyash Joshi, Basit Hasan, and R Brindha. Optimal Declarative Orchestration of Full Lifecycle of Machine Learning Models for Cloud Native. In 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), pages 578–582, Salem, India, June 2024. IEEE. [14] Salman Taherizadeh and Marko Grobelnik. Key influencing factors of the Kubernetes auto-scaler for computing-intensive microservice-native cloud-based applications. Advances in Engineering Software, 140:102734, February 2020. Publisher: Elsevier BV. [15] Jose Santos, Tim Wauters, Bruno Volckaert, and Filip De Turck. Towards Network-Aware Resource Provisioning in Kubernetes for Fog Computing Applications. In 2019 IEEE Conference on Network Softwarization (NetSoft), pages 351–359, Paris, France, June 2019. IEEE. [16] Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving. In Proceedings of the ACM Symposium on Cloud Computing, pages 639–653, Seattle WA USA, November 2021. ACM. [17] Dongwon Lee and Wesley W Chu. Semantic caching via query matching for web sources. In Proceedings of the eighth international conference on Information and knowledge management, pages 77–85, 1999. [18] Jiaxing Li, Chi Xu, Feng Wang, Isaac M. von Riedemann, Cong Zhang, and Jiangchuan Liu. SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models, May 2024. arXiv:2406.00025 [cs]. [19] Ramaswami Mohandoss. Context-based Semantic Caching for LLM Applications. In 2024 IEEE Conference on Artificial Intelligence (CAI), pages 371–376, Singapore, Singapore, June 2024. IEEE. [20] Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael I. Jordan, and Jiantao Jiao. On Optimal Caching and Model Multiplexing for Large Model Inference, August 2023. arXiv:2306.02003 [cs]. [21] Carlo Centofanti, José Santos, Venkateswarlu Gudepu, and Koteswararao Kondepu. Impact of power consumption in containerized clouds: A comprehensive analysis of open-source power measurement tools. Computer Networks, 245:110371, May 2024. Publisher: Elsevier BV. [22] Mathilde Jay, Vladimir Ostapenco, Laurent Lefevre, Denis Trystram, Anne-Cécile Orgerie, and Benjamin Fichel. An experimental comparison of software-based power meters: focus on CPU and GPU. In 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pages 106–118, Bangalore, India, May 2023. IEEE. [23] Marcelo Amaral, Huamin Chen, Tatsuhiro Chiba, Rina Nakazawa, Sunyanan Choochotkaew, Eun Kyung Lee, and Tamar Eilam. Kepler: A Framework to Calculate the Energy Consumption of Containerized Applications. In 2023 IEEE 16th International Conference on Cloud Computing (CLOUD), Chicago, IL, USA, July 2023. IEEE. [24] Google Kubernetes Engine. https://cloud.google.com/kubernetes-engine, 2025. [25] Walid A. Hanafy, Qianlin Liang, Noman Bashir, David Irwin, and Prashant Shenoy. CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-Efficiency. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(3):1–28, December 2023. Publisher: Association for Computing Machinery (ACM). [26] Haben Birhane Gebreweld. Evaluating the energy consumption impact of a carbon aware autoscaling in microservice-based applications on the public cloud: A sustainability perspective. 2023. [27] Minxian Xu and Rajkumar Buyya. BrownoutCon: A software system based on brownout and containers for energy-efficient cloud computing. Journal of Systems and Software, 155:91–103, September 2019. Publisher: Elsevier BV.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98714	-
dc.description.abstract	隨著大型語言模型（LLM）的突破性發展，其龐大的運算需求已成為全球資料中心能源消耗與碳排放增長的主要驅動力。與傳統的批次處理任務不同，互動式 LLM 應用程式不僅是能源、計算密集型任務，同時也需滿足使用者對低延遲服務品質（QoS）的要求，這對於雲端原生環境下的資源調度帶來一定程度的挑戰。快取機制（Cache Mechanism）在現今的網路世界被大量採用，而在大型語言模型應用程式數量快速增長下，語意快取（Semantic Caching）的出現也被視為一種降低延遲與成本的有效技術；不同於傳統快取機制多半要求完全符合（Exact Match），語意快取透過詞向量（Embeddings）與相似度（Similarity）的距離計算達成；然而，其對於系統總體能耗的真實影響，尤其是在特定參數設定下的權衡關係，目前學術界尚缺乏深入的實證研究。本研究目的在於量化並分析語意快取對 Kubernetes 環境下 LLM 應用程式能耗的影響。本研究在 GKE（Google Kubernetes Engine）上搭建了一個由 Ollama（Mistral-7B）推論服務、GPTCache 向量快取機制構成的實驗平台，以模擬真實世界的雲端原生環境，並整合 Kepler（Kubernetes-based Efficient Power Level Exporter）與 NVIDIA DCGM（Data Center GPU Manager）監控工具，以實現對 CPU、DRAM 及 GPU 功耗的全面性量測。研究核心在於系統性地評估 GPTCache 的相似度閾值（Similarity Threshold）此關鍵參數，對於系統的快取命中率、回應時間與總體能耗所造成的非線性影響。實驗結果顯示，語意快取的節能效益與其參數設定高度相關。當相似度閾值設定在一個較寬鬆的區間（0.7），高達 99.99% 的快取命中率能成功規避後端高耗能的 GPU 推論，使系統平均功率從 155.877 W 降低至 88.78 W（下降約 43%），換算在整個實驗週期的總能耗更是從 93,256 焦耳大幅減少至 52,487 焦耳，並將平均回應時間從 30,294 毫秒大幅縮短至約 105 毫秒。然而，隨著閾值變得嚴苛，系統效能開始下降，並在閾值約 0.85 時出現轉折，此時總平均功率（175.22 W）已超過基準組，對應的實驗週期內的總能耗也比基準組多出 11,604 焦耳。當閾值設定為 0.95 時，快取命中率驟降至 29.4%，並且系統平均回應時間達到 29,035 毫秒，僅比基準組的 30,294 毫秒低 4%，導致系統不僅要承擔 GPU 推論的完整成本，還付出因查詢向量化所產生的額外開銷，最終使總能耗增加超過 12%，達到 105,130焦耳，證實了不當的快取設定反而更耗能。本研究首次量化了語意快取在 LLM 應用中能源效益的特性，實證對快取參數的適當設定是實現節能目標、避免負面效果的前提之一。本研究成果不僅為 LLM 應用的能耗優化提供初步的探索研究，也為後續在 Kubernetes 平台上進行透過語意相似度、回應時間、能耗等指標調度大型語言模型工作負載的研究奠定基礎。	zh_TW
dc.description.abstract	With the breakthrough development of Large Language Models (LLM), their immense computational requirements have become a primary driver of the growth in energy consumption and carbon emissions in global data centers. Unlike traditional batch processing tasks, interactive LLM applications are not only energy- and compute-intensive but must also meet user demands for low-latency Quality of Service (QoS), which presents significant challenges for resource scheduling in cloud-native environments. Caching mechanisms are widely adopted in the modern web, and with the rapid growth in LLM applications, semantic caching has emerged as an effective technique for reducing latency and costs. Different from traditional caching, which mostly requires exact matching, semantic caching operates through the computation of distance and similarity in word embeddings. However, its actual impact on total system energy consumption, especially the trade-offs under specific parameter settings, currently lacks in-depth empirical research in academia. This study aims to quantify and analyze the impact of semantic caching on the energy consumption of LLM applications in a Kubernetes environment. We constructed an experimental platform consisting of an Ollama (Mistral-7B) inference service and the GPTCache vector cache mechanism, integrated with Kepler (Kubernetes-based Efficient Power Level Exporter) and NVIDIA DCGM (Data Center GPU Manager) monitoring tools to achieve comprehensive measurement of CPU, DRAM, and GPU power consumption. The core of the research involves a systematic evaluation of how a key parameter of GPTCache, the similarity threshold, non-linearly affects the system's cache hit rate, response time, and overall energy consumption. The experimental results demonstrate that the energy efficiency of semantic caching is highly conditional on its parameter configuration. When the similarity threshold was set to a lenient value (0.7), a high cache hit rate of 99.99% successfully circumvented the high-energy GPU inference, reducing average system power from 155.88 W to 88.78 W (a decrease of over 43%), which translates to a reduction in total energy consumption during the experiment from 93,256 J to 52,487 J and drastically improved the average response time from 30,294 ms to approximately 105 ms. However, as the threshold became stricter, performance degraded, reaching an inflection point at a threshold of 0.85, where the total average power consumption (175.22 W) surpassed the baseline, and the total energy consumption during the corresponding experimental period is also 11,604 J more than the baseline group. When the threshold was set to 0.95, the cache hit rate plummeted to 29.4%. Concurrently, the system's average response time reached 29,035 milliseconds, which was only 4% lower than the baseline group's 30,294 milliseconds. This led to the system not only bearing the full cost of GPU inference but also incurring additional overhead from query vectorization, ultimately increasing total energy consumption by more than 12% to 105,130 J, proving that improper cache configuration is more energy-intensive. This research provides the first quantitative characterization of the energy efficiency of semantic caching in LLM applications, empirically demonstrating that the appropriate configuration of cache parameters is a prerequisite for achieving energy-saving goals and avoiding negative effects. The findings of this study not only offer a preliminary exploratory investigation into the energy consumption optimization of LLM applications but also lay the groundwork for future research on scheduling LLM workloads on the Kubernetes platform using metrics such as semantic similarity, response time, and energy consumption.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T16:12:24Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-18T16:12:24Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝: i 摘要: iii Abstract: v Contents: viii List of Figures: x List of Tables: xi Chapter 1 Introduction: 1 1.1 Motivation: 1 1.2 Research Objectives: 3 1.3 Organization of Thesis: 4 Chapter 2 Literature Review: 6 2.1 Green Software: 6 2.2 Cloud Native Deployment: 7 2.3 Carbon Awareness Large Language Model Inference: 9 2.3.1 Direct Carbon Reduction Strategies: 9 2.3.2 Semantic Caching for Cost and Carbon Reduction: 10 2.4 Observability: 14 Chapter 3 Methodology: 16 3.1 Statement of Problem: 16 3.2 System and Experiment Design: 18 3.2.1 System Architecture: 18 3.2.2 Dataset and Workload Generation: 21 3.2.3 Energy Measurement Framework: 22 3.3 Experiment Variable: 22 3.4 Experiments and Measurement: 23 3.4.1 Procedure: 23 3.4.2 Data Collection and Measurement: 24 Chapter 4 Results and Discussion: 25 4.1 Results: 25 4.2 Discussion: 27 4.2.1 Analysis at Low Similarity Thresholds (0.7, 0.75, 0.8): 27 4.2.2 Analysis at an Intermediate Threshold (0.85, 0.9): 28 4.2.3 Analysis at a Strict Threshold (0.95): 29 4.3 Summary: 30 Chapter 5 Conclusion: 32 5.1 Key Findings: 32 5.2 Research Contribution: 33 5.3 Future Work: 34 References: 37	-
dc.language.iso	en	-
dc.subject	語意向量快取	zh_TW
dc.subject	永續人工智慧	zh_TW
dc.subject	大型語言模型推論	zh_TW
dc.subject	雲端原生架構	zh_TW
dc.subject	能源管理	zh_TW
dc.subject	Sustainable Artificial Intelligence	en
dc.subject	Semantic Vector Caching	en
dc.subject	Energy Management	en
dc.subject	Cloud Native Architecture	en
dc.subject	Large Language Model Inference	en
dc.title	大型語言模型推論中語意快取的能耗與效能權衡：相似度閾值調整的實證分析	zh_TW
dc.title	The Energy-Performance Trade-Off of Semantic Caching for LLM Inference: An Empirical Analysis of Similarity Threshold Tuning	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	謝依芸;張慰慈	zh_TW
dc.contributor.oralexamcommittee	I-Yun Hsieh;Wei-Tze Chang	en
dc.subject.keyword	語意向量快取,永續人工智慧,大型語言模型推論,雲端原生架構,能源管理,	zh_TW
dc.subject.keyword	Semantic Vector Caching,Sustainable Artificial Intelligence,Large Language Model Inference,Cloud Native Architecture,Energy Management,	en
dc.relation.page	41	-
dc.identifier.doi	10.6342/NTU202504158	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-08-13	-
dc.contributor.author-college	工學院	-
dc.contributor.author-dept	土木工程學系	-
dc.date.embargo-lift	2025-08-19	-
顯示於系所單位：	土木工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	1.24 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。