Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76451
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor楊佳玲
dc.contributor.authorLi-Jhan Chenen
dc.contributor.author陳立展zh_TW
dc.date.accessioned2021-07-09T15:52:34Z-
dc.date.available2021-11-02
dc.date.copyright2016-11-02
dc.date.issued2016
dc.date.submitted2016-08-15
dc.identifier.citation[1] K. M. abdalla et al. Scheduling and execution of compute tasks. US Patent US20130185725, 2013.
[2] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174, April 2009.
[3] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54, Oct 2009.
[4] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pages 235–246, New York, NY, USA, 2011. ACM.
[5] W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 15–24, New York, NY, USA, 2012. ACM.
[6] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260–271, Feb 2014.
[7] D. Li, M. Rhu, D. R. Johnson, M. O’Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. Priority-based cache allocation in throughput processors. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 89–100, Feb 2015.
[8] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308–317, New York, NY, USA, 2011. ACM.
[9] J. Nickolls and W. J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69, March 2010.
[10] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi . http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf, 2009.
[11] NVIDIA. Cuda c/c++ sdk code samples, 2011.
[12] NVIDIA. Kepler GK110 whitepaper. http://www.nvidia.com/content/PDF/
kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.
[13] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2007.
[14] M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Fine-grain task aggregation and coordination on gpus. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA ’14, pages 181–192, Piscataway, NJ, USA, 2014. IEEE Press.
[15] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium 32
on Microarchitecture, MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.
[16] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99–110, New York, NY, USA, 2013. ACM.
[17] J. E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test, 12(3):66–73, May 2010.
[18] X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for gpus. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 76–88, Feb 2015.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76451-
dc.description.abstract高效能的通用圖形處理器主要源自於大量執行緒的平行化以及資源的充分使用。在通用圖形處理器的執行模型下,數以千計的執行緒用來達到高程度並行。然而,每個核心上卻只有一個很小的第一級快取來減少存取記憶體的延遲時間。在大量的執行緒彼此競爭一個很小的第一級快取資源的情況下導致很差的快取效能並因此限制了系統的效能。
本碩論中,我們探索現有圖形處理器上的區塊排程器以及warp 排程器的設計,我們發現現有的區塊排程器傾向於將使用到相同快取列分散至不同的核心上,而減少了快取列重複使用的機會。因此,我們設計區域感知區塊及warp 排程器來提升快取效能。透過軟體和硬體的合作方式,讓區塊排程器能夠事先得知每個區塊會用到哪些快取列並將使用到相同快取列的區塊放在同一個核心上以增加快取列的重複使用的機會。此外,並藉由區域感知warp 排程器細粒度的控制warp 的執行順序來捕捉快取列重複使用的機會。實驗結果顯示,我們提出的區域感知排程器可以有效地提升快取效能,並比最先進的排程設計多提升約10% 的整體效能。
zh_TW
dc.description.abstractHigh performance computing on GPGPU is relied on maximizing thread-level parallelism and fully resource utilization. In GPUs execution model,thousands of threads are employed to achieve high level parallelism. However,only a small L1 cache resource is provided in each SM(streaming multiprocessor) to reduce memory access latency. Massive threads competing a small L1 cache resource causes poor cache performance and limits system performance.
In the thesis, we explore the current design of thread block and warp scheduler. We find current thread block scheduler tends to allocate thread blocks that use the same cache line to different SMs and reduce cache reuse opportunities. Therefore, we design Locality-Aware scheduler for improving GPU cache performance. Based on our proposed software and hardware cooperative method, cache line touched in a block can be known a prior so that thread block scheduler could put thread block with sharing cache line to the same SM for increasing cache line reuse opportunities. Besides that, locality-aware warp scheduler fine-grained controls execution order of warp to capture the cache locality. The result shows our Locality-Aware Scheduler could effectively improve cache performance and achieve 10% performance on average over the state-of-the-art scheduling policies.
en
dc.description.provenanceMade available in DSpace on 2021-07-09T15:52:34Z (GMT). No. of bitstreams: 1
ntu-105-R03922026-1.pdf: 1529116 bytes, checksum: 48df9980b0bc7c2c9de1593fed153400 (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents致謝 i
摘要 ii
Abstract iii
1 Introduction 1
2 Background & Motivation 3
2.1 Background 3
2.1.1 GPGPU Architecture 3
2.1.2 Thread Block Scheduling 4
2.1.3 Warp Scheduling 5
2.2 Motivation 6
3 Locality-Aware Scheduler 9
3.1 Overview of locality-aware scheduler 9
3.2 Compiler Support 10
3.3 Locality-Aware Thread Block Scheduler 12
3.3.1 Thread-Block-Level Access Range Calculation 12
3.3.2 Thread-Block-Dispatching Decision 14
3.4 Locality-Aware Warp Scheduler 15
3.4.1 Warp-Level Access Range Calculation 15
3.4.2 Two-level Warp Scheduler 16
4 Experimental Methodology 20
5 Evaluation 22
5.1 Effect of thread block scheduler 23
5.2 Effect of warp scheduler 24
5.3 Pipeline stall reduction 25
5.4 Hardware Overhead 27
6 Related Works 28
7 Conclusion 30
Bibliography 31
dc.language.isoen
dc.subject通用圖形處理器zh_TW
dc.subject第一級快取zh_TW
dc.subjectWarp 排程器zh_TW
dc.subject區塊排程器zh_TW
dc.subject高效能運算zh_TW
dc.subjectWarp(Wavefront) Scheduleren
dc.subjectBlock (CTA) Scheduleren
dc.subjectGPGPUen
dc.subjectHigh Performance Computingen
dc.subjectL1 Data Cacheen
dc.title透過位址預先計算以提升圖形處理器記憶體效能zh_TW
dc.titleImproving GPU Memory Performance via Address Pre-computationen
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee陳依蓉,呂仁碩,陳坤志
dc.subject.keyword通用圖形處理器,區塊排程器,Warp 排程器,第一級快取,高效能運算,zh_TW
dc.subject.keywordGPGPU,Block (CTA) Scheduler,Warp(Wavefront) Scheduler,L1 Data Cache,High Performance Computing,en
dc.relation.page33
dc.identifier.doi10.6342/NTU201602569
dc.rights.note同意授權(全球公開)
dc.date.accepted2016-08-16
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
dc.date.embargo-lift2021-11-02-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-105-R03922026-1.pdf1.49 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved