大型語言模型推論中語意快取的能耗與效能權衡：相似度閾值調整的實證分析

陳冠錞; Kuan-Chun Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98714

標題:	大型語言模型推論中語意快取的能耗與效能權衡：相似度閾值調整的實證分析 The Energy-Performance Trade-Off of Semantic Caching for LLM Inference: An Empirical Analysis of Similarity Threshold Tuning
作者:	陳冠錞 Kuan-Chun Chen
指導教授:	謝尚賢 Shang-Hsien Hsieh
關鍵字:	語意向量快取,永續人工智慧,大型語言模型推論,雲端原生架構,能源管理, Semantic Vector Caching,Sustainable Artificial Intelligence,Large Language Model Inference,Cloud Native Architecture,Energy Management,
出版年 :	2025
學位:	碩士
摘要:	隨著大型語言模型（LLM）的突破性發展，其龐大的運算需求已成為全球資料中心能源消耗與碳排放增長的主要驅動力。與傳統的批次處理任務不同，互動式 LLM 應用程式不僅是能源、計算密集型任務，同時也需滿足使用者對低延遲服務品質（QoS）的要求，這對於雲端原生環境下的資源調度帶來一定程度的挑戰。快取機制（Cache Mechanism）在現今的網路世界被大量採用，而在大型語言模型應用程式數量快速增長下，語意快取（Semantic Caching）的出現也被視為一種降低延遲與成本的有效技術；不同於傳統快取機制多半要求完全符合（Exact Match），語意快取透過詞向量（Embeddings）與相似度（Similarity）的距離計算達成；然而，其對於系統總體能耗的真實影響，尤其是在特定參數設定下的權衡關係，目前學術界尚缺乏深入的實證研究。本研究目的在於量化並分析語意快取對 Kubernetes 環境下 LLM 應用程式能耗的影響。本研究在 GKE（Google Kubernetes Engine）上搭建了一個由 Ollama（Mistral-7B）推論服務、GPTCache 向量快取機制構成的實驗平台，以模擬真實世界的雲端原生環境，並整合 Kepler（Kubernetes-based Efficient Power Level Exporter）與 NVIDIA DCGM（Data Center GPU Manager）監控工具，以實現對 CPU、DRAM 及 GPU 功耗的全面性量測。研究核心在於系統性地評估 GPTCache 的相似度閾值（Similarity Threshold）此關鍵參數，對於系統的快取命中率、回應時間與總體能耗所造成的非線性影響。實驗結果顯示，語意快取的節能效益與其參數設定高度相關。當相似度閾值設定在一個較寬鬆的區間（0.7），高達 99.99% 的快取命中率能成功規避後端高耗能的 GPU 推論，使系統平均功率從 155.877 W 降低至 88.78 W（下降約 43%），換算在整個實驗週期的總能耗更是從 93,256 焦耳大幅減少至 52,487 焦耳，並將平均回應時間從 30,294 毫秒大幅縮短至約 105 毫秒。然而，隨著閾值變得嚴苛，系統效能開始下降，並在閾值約 0.85 時出現轉折，此時總平均功率（175.22 W）已超過基準組，對應的實驗週期內的總能耗也比基準組多出 11,604 焦耳。當閾值設定為 0.95 時，快取命中率驟降至 29.4%，並且系統平均回應時間達到 29,035 毫秒，僅比基準組的 30,294 毫秒低 4%，導致系統不僅要承擔 GPU 推論的完整成本，還付出因查詢向量化所產生的額外開銷，最終使總能耗增加超過 12%，達到 105,130焦耳，證實了不當的快取設定反而更耗能。本研究首次量化了語意快取在 LLM 應用中能源效益的特性，實證對快取參數的適當設定是實現節能目標、避免負面效果的前提之一。本研究成果不僅為 LLM 應用的能耗優化提供初步的探索研究，也為後續在 Kubernetes 平台上進行透過語意相似度、回應時間、能耗等指標調度大型語言模型工作負載的研究奠定基礎。 With the breakthrough development of Large Language Models (LLM), their immense computational requirements have become a primary driver of the growth in energy consumption and carbon emissions in global data centers. Unlike traditional batch processing tasks, interactive LLM applications are not only energy- and compute-intensive but must also meet user demands for low-latency Quality of Service (QoS), which presents significant challenges for resource scheduling in cloud-native environments. Caching mechanisms are widely adopted in the modern web, and with the rapid growth in LLM applications, semantic caching has emerged as an effective technique for reducing latency and costs. Different from traditional caching, which mostly requires exact matching, semantic caching operates through the computation of distance and similarity in word embeddings. However, its actual impact on total system energy consumption, especially the trade-offs under specific parameter settings, currently lacks in-depth empirical research in academia. This study aims to quantify and analyze the impact of semantic caching on the energy consumption of LLM applications in a Kubernetes environment. We constructed an experimental platform consisting of an Ollama (Mistral-7B) inference service and the GPTCache vector cache mechanism, integrated with Kepler (Kubernetes-based Efficient Power Level Exporter) and NVIDIA DCGM (Data Center GPU Manager) monitoring tools to achieve comprehensive measurement of CPU, DRAM, and GPU power consumption. The core of the research involves a systematic evaluation of how a key parameter of GPTCache, the similarity threshold, non-linearly affects the system's cache hit rate, response time, and overall energy consumption. The experimental results demonstrate that the energy efficiency of semantic caching is highly conditional on its parameter configuration. When the similarity threshold was set to a lenient value (0.7), a high cache hit rate of 99.99% successfully circumvented the high-energy GPU inference, reducing average system power from 155.88 W to 88.78 W (a decrease of over 43%), which translates to a reduction in total energy consumption during the experiment from 93,256 J to 52,487 J and drastically improved the average response time from 30,294 ms to approximately 105 ms. However, as the threshold became stricter, performance degraded, reaching an inflection point at a threshold of 0.85, where the total average power consumption (175.22 W) surpassed the baseline, and the total energy consumption during the corresponding experimental period is also 11,604 J more than the baseline group. When the threshold was set to 0.95, the cache hit rate plummeted to 29.4%. Concurrently, the system's average response time reached 29,035 milliseconds, which was only 4% lower than the baseline group's 30,294 milliseconds. This led to the system not only bearing the full cost of GPU inference but also incurring additional overhead from query vectorization, ultimately increasing total energy consumption by more than 12% to 105,130 J, proving that improper cache configuration is more energy-intensive. This research provides the first quantitative characterization of the energy efficiency of semantic caching in LLM applications, empirically demonstrating that the appropriate configuration of cache parameters is a prerequisite for achieving energy-saving goals and avoiding negative effects. The findings of this study not only offer a preliminary exploratory investigation into the energy consumption optimization of LLM applications but also lay the groundwork for future research on scheduling LLM workloads on the Kubernetes platform using metrics such as semantic similarity, response time, and energy consumption.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98714
DOI:	10.6342/NTU202504158
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-08-19
顯示於系所單位：	土木工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	1.24 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。