基於最後一層快取記憶體加權存取延遲值之中央處理器與繪圖處理器異質性架構的快取記憶體分割機制

Cheng-Hsuan Li; 李承軒

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/56052

標題:	基於最後一層快取記憶體加權存取延遲值之中央處理器與繪圖處理器異質性架構的快取記憶體分割機制 Weighted LLC Latency-Based Run-Time Cache Partitioning for Heterogeneous CPU-GPU Architecture
作者:	Cheng-Hsuan Li 李承軒
指導教授:	楊佳玲(Chia-Lin Yang)
關鍵字:	快取記憶體分割,異質性平台,主記憶體存取延遲,中央處理器,繪圖處理器, cache partitioning,off-chip latency,heterogeneous architecture,CPU,GPU,
出版年 :	2014
學位:	碩士
摘要:	近年來中央處理器-繪圖處理器(CPU-GPU)異質性整合架構漸漸成為市場處理器主流。而在此種整合架構上，由於最後一層快取受到多個核心與GPU的頻繁存取壓力，使得管理策略成為改善整體系統效能的重要議題。以往管理策略旨在解決程式在快取記憶體內的相互干擾而快取分割(Cache partitioning)是廣泛被使用技巧，而分割空間的分配則通常以能最小化總系統快取失誤頻率為目標。但在異質性系統內，由於GPU的高度資料存取率以及對記憶體延遲的容忍程度，使得原始的空間分配策略傾向於將快取記憶體分配給效能增幅有限的GPU而失去最大化總系統效能的機會。後來針對的異值性系統設計的分配策略(如TAP)則意識到此問題，因而提倡當偵測到快取記憶體對GPU效能提升有限時轉而優化CPU快取命中率的分配策略。然而，此策略由於只專注於快取記憶體對GPU的效能貢獻，而忽略了快取記憶體對晶片外主記憶體(off-chip memory)的流量調節作用，因此將導致GPU直接對主記憶體的高頻存取。此高頻存取在主記憶體頻寬有限的情況下，將造成主記憶體存取因壅塞而惡化回應時間，進而導致總體系統效能的下降。。由於程式的執行效率不僅取決於快取記憶體命中率，而與主記憶體回應時間與也息息相關。因此，分配快取空間時應同時兩者對程式效能的貢獻。。本論文首先提出基於最後一層快取失誤總次數預測各程式的主記憶體存取延遲的方法，如此可預測快取分配結果對各程式存取主記憶體的延遲程度。另外，獲得主記憶體延遲的參考資訊後，本論文也進一步提出基於快取存取時間的效能預測模型，如此可成功達成從快取分配結果預測效能的衝擊的目的。在30個異質混合多程式工作的實驗結果顯示，本論文提倡之方法與「考慮執行緒層級平行的快取分配策略方法」(Thread-level parallelism aware cache management policy)比較可增進效能達10.7%、與「參考快取使用率分配策略」(Utility-based cache partitioning policy)比較可增進效能達6.2%、而與基礎的最晚使用資料驅逐策略(Least recently used eviction policy)比較效能增幅達10.9%。 Integrating the CPU and GPU on the same chip has become the development trend for microprocessor design. In integrated CPU-GPU architecture, utilizing the shared last-level cache (LLC) is a critical design issue due to the pressure on shared resources and the different characteristics of CPU and GPU applications. Because of the latency-hiding capability provided by the GPU and the huge discrepancy in concurrent executing threads between the CPU and GPU, LLC partitioning can no longer be achieved by simply minimizing the overall cache misses as in homogeneous CPUs. State-of-the-art cache partitioning mechanism distinguishes those cache-insensitive GPU applications from those cache-sensitive ones and optimize only the cache misses for CPU applications when the GPU is cache-insensitive. However, optimizing only the cache hit rate for CPU applications generates more cache misses from the GPU and leads to longer queuing delay in the underlying DRAM system. In terms of memory access latency, the loss due to longer queuing delay may out-weight the benefit from higher cache hit ratio. Therefore, we find that even though the performance of the GPU application may not be sensitive to cache resources, CPU applications' cache hit rate is not the only factor which should be considered in partitioning the LLC. Cache miss penalty, i.e., off-chip latency, is also an important factor in designing LLC partitioning mechanism for integrated CPU-GPU architecture. In this paper, we proposed a Weighted LLC Latency-Based Run-Time Cache Partitioning for integrated CPU-GPU architecture. In order to correlate cache partition to overall performance more accurately, we develops a mechanism to predict the off-chip latency based on the number of total cache misses, and a GPU cache-sensitivity monitor, which quantitatively profiles GPU's performance sensitivity to memory access latency. The experimental results show that the proposed mechanism improves the overall throughput by 9.7% over TLP-aware cache partitioning (TAP), 6.2% over Utility-based Cache Partitioning (UCP), and 10.9% over LRU on 30 heterogeneous workloads.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/56052
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-103-1.pdf 目前未授權公開取用	1.59 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。