透過位址預先計算以提升圖形處理器記憶體效能

Li-Jhan Chen; 陳立展

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76451

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲
dc.contributor.author	Li-Jhan Chen	en
dc.contributor.author	陳立展	zh_TW
dc.date.accessioned	2021-07-09T15:52:34Z	-
dc.date.available	2021-11-02
dc.date.copyright	2016-11-02
dc.date.issued	2016
dc.date.submitted	2016-08-15
dc.identifier.citation	[1] K. M. abdalla et al. Scheduling and execution of compute tasks. US Patent US20130185725, 2013. [2] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174, April 2009. [3] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54, Oct 2009. [4] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pages 235–246, New York, NY, USA, 2011. ACM. [5] W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 15–24, New York, NY, USA, 2012. ACM. [6] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260–271, Feb 2014. [7] D. Li, M. Rhu, D. R. Johnson, M. O’Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. Priority-based cache allocation in throughput processors. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 89–100, Feb 2015. [8] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308–317, New York, NY, USA, 2011. ACM. [9] J. Nickolls and W. J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69, March 2010. [10] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi . http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf, 2009. [11] NVIDIA. Cuda c/c++ sdk code samples, 2011. [12] NVIDIA. Kepler GK110 whitepaper. http://www.nvidia.com/content/PDF/ kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012. [13] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2007. [14] M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Fine-grain task aggregation and coordination on gpus. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA ’14, pages 181–192, Piscataway, NJ, USA, 2014. IEEE Press. [15] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium 32 on Microarchitecture, MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society. [16] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99–110, New York, NY, USA, 2013. ACM. [17] J. E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test, 12(3):66–73, May 2010. [18] X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for gpus. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 76–88, Feb 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76451	-
dc.description.abstract	高效能的通用圖形處理器主要源自於大量執行緒的平行化以及資源的充分使用。在通用圖形處理器的執行模型下，數以千計的執行緒用來達到高程度並行。然而，每個核心上卻只有一個很小的第一級快取來減少存取記憶體的延遲時間。在大量的執行緒彼此競爭一個很小的第一級快取資源的情況下導致很差的快取效能並因此限制了系統的效能。本碩論中，我們探索現有圖形處理器上的區塊排程器以及warp 排程器的設計，我們發現現有的區塊排程器傾向於將使用到相同快取列分散至不同的核心上，而減少了快取列重複使用的機會。因此，我們設計區域感知區塊及warp 排程器來提升快取效能。透過軟體和硬體的合作方式，讓區塊排程器能夠事先得知每個區塊會用到哪些快取列並將使用到相同快取列的區塊放在同一個核心上以增加快取列的重複使用的機會。此外，並藉由區域感知warp 排程器細粒度的控制warp 的執行順序來捕捉快取列重複使用的機會。實驗結果顯示，我們提出的區域感知排程器可以有效地提升快取效能，並比最先進的排程設計多提升約10% 的整體效能。	zh_TW
dc.description.abstract	High performance computing on GPGPU is relied on maximizing thread-level parallelism and fully resource utilization. In GPUs execution model,thousands of threads are employed to achieve high level parallelism. However,only a small L1 cache resource is provided in each SM(streaming multiprocessor) to reduce memory access latency. Massive threads competing a small L1 cache resource causes poor cache performance and limits system performance. In the thesis, we explore the current design of thread block and warp scheduler. We find current thread block scheduler tends to allocate thread blocks that use the same cache line to different SMs and reduce cache reuse opportunities. Therefore, we design Locality-Aware scheduler for improving GPU cache performance. Based on our proposed software and hardware cooperative method, cache line touched in a block can be known a prior so that thread block scheduler could put thread block with sharing cache line to the same SM for increasing cache line reuse opportunities. Besides that, locality-aware warp scheduler fine-grained controls execution order of warp to capture the cache locality. The result shows our Locality-Aware Scheduler could effectively improve cache performance and achieve 10% performance on average over the state-of-the-art scheduling policies.	en
dc.description.provenance	Made available in DSpace on 2021-07-09T15:52:34Z (GMT). No. of bitstreams: 1 ntu-105-R03922026-1.pdf: 1529116 bytes, checksum: 48df9980b0bc7c2c9de1593fed153400 (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	致謝 i 摘要 ii Abstract iii 1 Introduction 1 2 Background & Motivation 3 2.1 Background 3 2.1.1 GPGPU Architecture 3 2.1.2 Thread Block Scheduling 4 2.1.3 Warp Scheduling 5 2.2 Motivation 6 3 Locality-Aware Scheduler 9 3.1 Overview of locality-aware scheduler 9 3.2 Compiler Support 10 3.3 Locality-Aware Thread Block Scheduler 12 3.3.1 Thread-Block-Level Access Range Calculation 12 3.3.2 Thread-Block-Dispatching Decision 14 3.4 Locality-Aware Warp Scheduler 15 3.4.1 Warp-Level Access Range Calculation 15 3.4.2 Two-level Warp Scheduler 16 4 Experimental Methodology 20 5 Evaluation 22 5.1 Effect of thread block scheduler 23 5.2 Effect of warp scheduler 24 5.3 Pipeline stall reduction 25 5.4 Hardware Overhead 27 6 Related Works 28 7 Conclusion 30 Bibliography 31
dc.language.iso	en
dc.subject	通用圖形處理器	zh_TW
dc.subject	第一級快取	zh_TW
dc.subject	Warp 排程器	zh_TW
dc.subject	區塊排程器	zh_TW
dc.subject	高效能運算	zh_TW
dc.subject	Warp(Wavefront) Scheduler	en
dc.subject	Block (CTA) Scheduler	en
dc.subject	GPGPU	en
dc.subject	High Performance Computing	en
dc.subject	L1 Data Cache	en
dc.title	透過位址預先計算以提升圖形處理器記憶體效能	zh_TW
dc.title	Improving GPU Memory Performance via Address Pre-computation	en
dc.type	Thesis
dc.date.schoolyear	104-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳依蓉,呂仁碩,陳坤志
dc.subject.keyword	通用圖形處理器,區塊排程器,Warp 排程器,第一級快取,高效能運算,	zh_TW
dc.subject.keyword	GPGPU,Block (CTA) Scheduler,Warp(Wavefront) Scheduler,L1 Data Cache,High Performance Computing,	en
dc.relation.page	33
dc.identifier.doi	10.6342/NTU201602569
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2016-08-16
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
dc.date.embargo-lift	2021-11-02	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-R03922026-1.pdf	1.49 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。