分析單一指令多重執行緒執行在現代圖形處理器上的圖形及一般目的圖形處理器程序

Chen-Ming Chung; 鐘振銘

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63815

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Chen-Ming Chung	en
dc.contributor.author	鐘振銘	zh_TW
dc.date.accessioned	2021-06-16T17:19:52Z	-
dc.date.available	2014-08-19
dc.date.copyright	2012-08-19
dc.date.issued	2012
dc.date.submitted	2012-08-17
dc.identifier.citation	[1] Imagination Technologies PowerVR Insider SDK. [2] Unreal Technology. [3] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Com- puter System Modeling. Computer, 35(2):59 {67, feb 2002. [4] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Performance Analysis of Sys- tems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163 {174, april 2009. [5] M. Chambers. NVIDIA GeForce3 Preview. [6] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Shea er, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, pages 44{54. IEEE, 2009. [7] B. W. Coon and J. E. Lindholm. Sysem and Method for Processing Thread Groups in A SIMD Architecture, June 2007. [8] B. W. Coon and J. E. Lindholm. System and Method for Managing Divergent Threads in A SIMD Architecture, April 2008. [9] S. P. E. Corporation. Specviewperf 11. [10] V. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. E. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Performance Analysis of Systems and Software, 2006 IEEE International Sym- posium on, pages 231 { 241, march 2006. [11] J. S. Donham, Christopher D. S.and Montrym and P. R. Marchand. US Patent 7565490:Out of Order Graphics L2 Cache, July 2009. [12] S. Drone. Under the Hood: Revving up Shader Performance. Gamefest Un- plugged (Europe), 2007. [13] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic Warp Formation and Scheduling for E cient GPU Control Flow. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 407 {420, dec. 2007. [14] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-E cient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, pages 235{246, 2011. [15] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. D. Kirchner, and J. T. Klosowski. Chromium: A Stream-Processing Framework for Interactive Rendering on Clusters. In Proceedings of the 29th annual conference on Com- puter graphics and interactive techniques, SIGGRAPH '02, pages 693{702, New York, NY, USA, 2002. ACM. [16] S. V. I. Antochi, B. Juurlink and P. Liuha. Graalbench: A 3D Graphics Bench- mark Suite for Mobile Phones. In In Conference on Languages, Compliers, and Tools for Embedded Systems, 2004. [17] K. F. I. Buck and P. Hanrahan. Gpubench: Evaluating gpu performance for numerical and scienti c applications. In In ACM Workshop on General Purpose Computing on Graphics Processors, 2004. [18] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Uni ed Graphics and Computing Architecture. Micro, IEEE, 28(2):39 {55, march-april 2008. [19] C. W. Lo. A Cycle-Accurate Simulator for Modern GPU. Master's thesis, National Taiwan University, Taiwan, 2010. [20] J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Inte- grated Branch and Memory Divergence Tolerance. In Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture. ACM/IEEE, Jun. 2010. [21] A. L. Minkin and O. Rubinstein. US Patent 6629188:Circuit and Method for Prefetching Data for A Texture Cache, September 2003. [22] S. S. Moy and J. E. Lindholm. Across-Thread Out Of Order Instruction Dis- patch in A Multithreaded Graphics Processor, December 2007. [23] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In Proceedings of the 44th Annual IEEE/ACM International Sym- posium on Microarchitecture, MICRO-44 '11, pages 308{317, New York, NY, USA, 2011. ACM. [24] NVIDIA. NVIDIA's New Generation CUDA Compute Architecture: Fermi. [25] J. W. Shea er, D. Luebke, and K. Skadron. A Flexible Simulation Framework for Graphics Architectures. In Proceedings of the ACM SIG- GRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '04, pages 85{94, New York, NY, USA, 2004. ACM. [26] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-Aware Microarchitecture: Modeling and Implementa- tion. ACM Trans. Archit. Code Optim., 1(1):94{125, Mar. 2004. [27] N. Tatarchuk. Dynamic Parallax Occlusion Mapping with Approximate Soft Shadows. In I3D '06: Proceedings of the 2006 symposium on Interactive 3D graphics and games, pages 63{69, New York, NY, USA, 2006. ACM. [28] Y. Uralsky and A. Ahmad. Soft Shadows. NVIDIA SDK White Paper, 2004.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63815	-
dc.description.abstract	現代圖形處理器提供了比集中式處理器更強大的運算能力，讓它在現今的研究獲得了更多的注意。在圖形處理器上，圖形應用程式已經不是唯一的應用程式，現在有許多有著高度平行度的一般應用程式原本執行在集中式處理器上的很適合執行在圖形處理器上，這類的工作被稱作一般目的圖形工作量，由於圖形化與一般目的的工作量注重的重點是不一樣的，它們不一定都可以在相同的架構上得到好處。然而，一個模擬這兩類應用程式的圖形化架構研究是非常少的，在我們先前的研究，我們已經有一個已週期為單位支援圖形程式語言的單指令多任務執行架構的處理器，單指令多任務執行架構的圖形處理器會將許多線程組成線程束來執行，而在這篇碩論，我們將擴展這個研究讓它可以支援一般目的的應用程式，有了一個能在同樣的圖形化處理器架構下執行此兩類的應用程式，我們去分析一些特性，包括動態指令混合、跳躍指令分散比、單指令多任務執行寬度之影響、同時執行線程束數量的影響以及線程束排程的影響等。	zh_TW
dc.description.abstract	Modern Graphics Processing Units (GPUs) have obtained a lot of attention recently since they provide orders of magnitude more computing power than CPUs. Graphics application is not the only workloads for GPUs. General purpose applications with high data-parallelism are also suitable for GPUs, called GPGPU applications. Due to the different purpose of GPGPU and graphics application, they may not benefit from the same architectural design. Therefore, a simulation framework supporting both applications is mandatory for GPU architecture research. In our previous work, we presented a cycle-level simulation framework for modern GPUs that models the SIMT(Single-Instruction Multiple-Threads) execution pipeline and support graphics workloads (OpenGL ES). In this thesis, we extend the above work to support GPGPU applications as well. With the simulation framework, we conduct workload characterization for both graphics and GPGPU workloads, including dynamic instruction mixes, branch divergence ratio, effects of concurrent warps, SIMT width and warp scheduling policies.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T17:19:52Z (GMT). No. of bitstreams: 1 ntu-101-R99922128-1.pdf: 4306742 bytes, checksum: e05086edacbbfb5409ce9c3f53bcb97c (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	1 Introduction 1 2 Related Works 4 2.1 GPU Simulation Frameworks . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Warp Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 GPU Overview 8 3.1 Single-Instruction Multiple-Thread (SIMT) Execution Model . . . . . 9 3.2 Microarchitecture of Stream Multiprocessor . . . . . . . . . . . . . . 11 3.2.1 Supports to Control- ow Divergence . . . . . . . . . . . . . . 12 4 CUDA Simulation Framework 16 4.1 OpenGL ES Simulation Framework . . . . . . . . . . . . . . . . . . . 16 4.2 CUDA Simulation Framework . . . . . . . . . . . . . . . . . . . . . . 17 4.2.1 CUDA Interceptor . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 PTX Assembler . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 CUDA Sim-driver . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 SIMT-GPU Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Experimental Setup 25 5.1 Hardware Conguration . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 Experimental Results 28 6.1 Dynamic Instruction Distribution . . . . . . . . . . . . . . . . . . . . 29 6.2 Eects of Concurrent Warps . . . . . . . . . . . . . . . . . . . . . . . 30 6.3 Eects of SIMT Width . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.4 Eect of Warp Scheduling Policies . . . . . . . . . . . . . . . . . . . . 34 6.4.1 Latency Hiding Capabilities . . . . . . . . . . . . . . . . . . . 34 6.4.2 Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . 35 6.4.3 Register File Usage . . . . . . . . . . . . . . . . . . . . . . . . 36 7 Conclusion 42 Bibliography 43
dc.language.iso	en
dc.title	分析單一指令多重執行緒執行在現代圖形處理器上的圖形及一般目的圖形處理器程序	zh_TW
dc.title	SIMT Execution Analysis for Graphics and GPGPU applications on Modern GPUs	en
dc.type	Thesis
dc.date.schoolyear	100-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳維超,洪士灝
dc.subject.keyword	繪圖晶片架構,線程束排程,一般目的圖形工作量,圖形工作量,單指令多任務執行模型,單指令多任務執行寬度,同時執行線程束數量,	zh_TW
dc.subject.keyword	GPU Architecture,Warp Scheduling,GPGPU Benchmarks,Graphics Workloads,SIMT execution model,SIMT width,Concurrent Warps,	en
dc.relation.page	45
dc.rights.note	有償授權
dc.date.accepted	2012-08-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 目前未授權公開取用	4.21 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。