Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7885
Title: | 透過位址預先計算以提升圖形處理器記憶體效能 Improving GPU Memory Performance via Address Pre-computation |
Authors: | Li-Jhan Chen 陳立展 |
Advisor: | 楊佳玲 |
Keyword: | 通用圖形處理器,區塊排程器,Warp 排程器,第一級快取,高效能運算, GPGPU,Block (CTA) Scheduler,Warp(Wavefront) Scheduler,L1 Data Cache,High Performance Computing, |
Publication Year : | 2016 |
Degree: | 碩士 |
Abstract: | 高效能的通用圖形處理器主要源自於大量執行緒的平行化以及資源的充分使用。在通用圖形處理器的執行模型下,數以千計的執行緒用來達到高程度並行。然而,每個核心上卻只有一個很小的第一級快取來減少存取記憶體的延遲時間。在大量的執行緒彼此競爭一個很小的第一級快取資源的情況下導致很差的快取效能並因此限制了系統的效能。
本碩論中,我們探索現有圖形處理器上的區塊排程器以及warp 排程器的設計,我們發現現有的區塊排程器傾向於將使用到相同快取列分散至不同的核心上,而減少了快取列重複使用的機會。因此,我們設計區域感知區塊及warp 排程器來提升快取效能。透過軟體和硬體的合作方式,讓區塊排程器能夠事先得知每個區塊會用到哪些快取列並將使用到相同快取列的區塊放在同一個核心上以增加快取列的重複使用的機會。此外,並藉由區域感知warp 排程器細粒度的控制warp 的執行順序來捕捉快取列重複使用的機會。實驗結果顯示,我們提出的區域感知排程器可以有效地提升快取效能,並比最先進的排程設計多提升約10% 的整體效能。 High performance computing on GPGPU is relied on maximizing thread-level parallelism and fully resource utilization. In GPUs execution model,thousands of threads are employed to achieve high level parallelism. However,only a small L1 cache resource is provided in each SM(streaming multiprocessor) to reduce memory access latency. Massive threads competing a small L1 cache resource causes poor cache performance and limits system performance. In the thesis, we explore the current design of thread block and warp scheduler. We find current thread block scheduler tends to allocate thread blocks that use the same cache line to different SMs and reduce cache reuse opportunities. Therefore, we design Locality-Aware scheduler for improving GPU cache performance. Based on our proposed software and hardware cooperative method, cache line touched in a block can be known a prior so that thread block scheduler could put thread block with sharing cache line to the same SM for increasing cache line reuse opportunities. Besides that, locality-aware warp scheduler fine-grained controls execution order of warp to capture the cache locality. The result shows our Locality-Aware Scheduler could effectively improve cache performance and achieve 10% performance on average over the state-of-the-art scheduling policies. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7885 |
DOI: | 10.6342/NTU201602569 |
Fulltext Rights: | 同意授權(全球公開) |
Appears in Collections: | 資訊工程學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-105-1.pdf | 1.49 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.