請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63815
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 楊佳玲(Chia-Lin Yang) | |
dc.contributor.author | Chen-Ming Chung | en |
dc.contributor.author | 鐘振銘 | zh_TW |
dc.date.accessioned | 2021-06-16T17:19:52Z | - |
dc.date.available | 2014-08-19 | |
dc.date.copyright | 2012-08-19 | |
dc.date.issued | 2012 | |
dc.date.submitted | 2012-08-17 | |
dc.identifier.citation | [1] Imagination Technologies PowerVR Insider SDK.
[2] Unreal Technology. [3] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Com- puter System Modeling. Computer, 35(2):59 {67, feb 2002. [4] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Performance Analysis of Sys- tems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163 {174, april 2009. [5] M. Chambers. NVIDIA GeForce3 Preview. [6] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Shea er, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, pages 44{54. IEEE, 2009. [7] B. W. Coon and J. E. Lindholm. Sysem and Method for Processing Thread Groups in A SIMD Architecture, June 2007. [8] B. W. Coon and J. E. Lindholm. System and Method for Managing Divergent Threads in A SIMD Architecture, April 2008. [9] S. P. E. Corporation. Specviewperf 11. [10] V. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. E. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Performance Analysis of Systems and Software, 2006 IEEE International Sym- posium on, pages 231 { 241, march 2006. [11] J. S. Donham, Christopher D. S.and Montrym and P. R. Marchand. US Patent 7565490:Out of Order Graphics L2 Cache, July 2009. [12] S. Drone. Under the Hood: Revving up Shader Performance. Gamefest Un- plugged (Europe), 2007. [13] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic Warp Formation and Scheduling for E cient GPU Control Flow. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 407 {420, dec. 2007. [14] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-E cient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, pages 235{246, 2011. [15] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. D. Kirchner, and J. T. Klosowski. Chromium: A Stream-Processing Framework for Interactive Rendering on Clusters. In Proceedings of the 29th annual conference on Com- puter graphics and interactive techniques, SIGGRAPH '02, pages 693{702, New York, NY, USA, 2002. ACM. [16] S. V. I. Antochi, B. Juurlink and P. Liuha. Graalbench: A 3D Graphics Bench- mark Suite for Mobile Phones. In In Conference on Languages, Compliers, and Tools for Embedded Systems, 2004. [17] K. F. I. Buck and P. Hanrahan. Gpubench: Evaluating gpu performance for numerical and scienti c applications. In In ACM Workshop on General Purpose Computing on Graphics Processors, 2004. [18] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Uni ed Graphics and Computing Architecture. Micro, IEEE, 28(2):39 {55, march-april 2008. [19] C. W. Lo. A Cycle-Accurate Simulator for Modern GPU. Master's thesis, National Taiwan University, Taiwan, 2010. [20] J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Inte- grated Branch and Memory Divergence Tolerance. In Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture. ACM/IEEE, Jun. 2010. [21] A. L. Minkin and O. Rubinstein. US Patent 6629188:Circuit and Method for Prefetching Data for A Texture Cache, September 2003. [22] S. S. Moy and J. E. Lindholm. Across-Thread Out Of Order Instruction Dis- patch in A Multithreaded Graphics Processor, December 2007. [23] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In Proceedings of the 44th Annual IEEE/ACM International Sym- posium on Microarchitecture, MICRO-44 '11, pages 308{317, New York, NY, USA, 2011. ACM. [24] NVIDIA. NVIDIA's New Generation CUDA Compute Architecture: Fermi. [25] J. W. Shea er, D. Luebke, and K. Skadron. A Flexible Simulation Framework for Graphics Architectures. In Proceedings of the ACM SIG- GRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '04, pages 85{94, New York, NY, USA, 2004. ACM. [26] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-Aware Microarchitecture: Modeling and Implementa- tion. ACM Trans. Archit. Code Optim., 1(1):94{125, Mar. 2004. [27] N. Tatarchuk. Dynamic Parallax Occlusion Mapping with Approximate Soft Shadows. In I3D '06: Proceedings of the 2006 symposium on Interactive 3D graphics and games, pages 63{69, New York, NY, USA, 2006. ACM. [28] Y. Uralsky and A. Ahmad. Soft Shadows. NVIDIA SDK White Paper, 2004. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63815 | - |
dc.description.abstract | 現代圖形處理器提供了比集中式處理器更強大的運算能力,讓它在現今的研究獲得了更多的注意。在圖形處理器上,圖形應用程式已經不是唯一的應用程式,現在有許多有著高度平行度的一般應用程式原本執行在集中式處理器上的很適合執行在圖形處理器上,這類的工作被稱作一般目的圖形工作量,由於圖形化與一般目的的工作量注重的重點是不一樣的,它們不一定都可以在相同的架構上得到好處。然而,一個模擬這兩類應用程式的圖形化架構研究是非常少的,在我們先前的研究,我們已經有一個已週期為單位支援圖形程式語言的單指令多任務執行架構的處理器,單指令多任務執行架構的圖形處理器會將許多線程組成線程束來執行,而在這篇碩論,我們將擴展這個研究讓它可以支援一般目的的應用程式,有了一個能在同樣的圖形化處理器架構下執行此兩類的應用程式,我們去分析一些特性,包括動態指令混合、跳躍指令分散比、單指令多任務執行寬度之影響、同時執行線程束數量的影響以及線程束排程的影響等。 | zh_TW |
dc.description.abstract | Modern Graphics Processing Units (GPUs) have obtained a lot of attention recently since they provide orders of magnitude more computing power than CPUs. Graphics application is not the only workloads for GPUs. General purpose applications with high data-parallelism are also suitable for GPUs, called GPGPU applications. Due to the different purpose of GPGPU and graphics application, they may not benefit from the same architectural design. Therefore, a simulation framework supporting both applications is mandatory for GPU architecture research. In our previous work, we presented a cycle-level simulation framework for modern GPUs that models the SIMT(Single-Instruction Multiple-Threads) execution pipeline and support graphics workloads (OpenGL ES). In this thesis, we extend the above work to support GPGPU applications as well. With the simulation framework, we conduct workload characterization for both graphics and GPGPU workloads, including dynamic instruction mixes, branch divergence ratio, effects of concurrent warps, SIMT width and warp scheduling policies. | en |
dc.description.provenance | Made available in DSpace on 2021-06-16T17:19:52Z (GMT). No. of bitstreams: 1 ntu-101-R99922128-1.pdf: 4306742 bytes, checksum: e05086edacbbfb5409ce9c3f53bcb97c (MD5) Previous issue date: 2012 | en |
dc.description.tableofcontents | 1 Introduction 1
2 Related Works 4 2.1 GPU Simulation Frameworks . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Warp Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 GPU Overview 8 3.1 Single-Instruction Multiple-Thread (SIMT) Execution Model . . . . . 9 3.2 Microarchitecture of Stream Multiprocessor . . . . . . . . . . . . . . 11 3.2.1 Supports to Control- ow Divergence . . . . . . . . . . . . . . 12 4 CUDA Simulation Framework 16 4.1 OpenGL ES Simulation Framework . . . . . . . . . . . . . . . . . . . 16 4.2 CUDA Simulation Framework . . . . . . . . . . . . . . . . . . . . . . 17 4.2.1 CUDA Interceptor . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 PTX Assembler . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 CUDA Sim-driver . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 SIMT-GPU Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Experimental Setup 25 5.1 Hardware Conguration . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 Experimental Results 28 6.1 Dynamic Instruction Distribution . . . . . . . . . . . . . . . . . . . . 29 6.2 Eects of Concurrent Warps . . . . . . . . . . . . . . . . . . . . . . . 30 6.3 Eects of SIMT Width . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.4 Eect of Warp Scheduling Policies . . . . . . . . . . . . . . . . . . . . 34 6.4.1 Latency Hiding Capabilities . . . . . . . . . . . . . . . . . . . 34 6.4.2 Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . 35 6.4.3 Register File Usage . . . . . . . . . . . . . . . . . . . . . . . . 36 7 Conclusion 42 Bibliography 43 | |
dc.language.iso | en | |
dc.title | 分析單一指令多重執行緒執行在現代圖形處理器上的圖形及一般目的圖形處理器程序 | zh_TW |
dc.title | SIMT Execution Analysis for Graphics and GPGPU applications on Modern GPUs | en |
dc.type | Thesis | |
dc.date.schoolyear | 100-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 陳維超,洪士灝 | |
dc.subject.keyword | 繪圖晶片架構,線程束排程,一般目的圖形工作量,圖形工作量,單指令多任務執行模型,單指令多任務執行寬度,同時執行線程束數量, | zh_TW |
dc.subject.keyword | GPU Architecture,Warp Scheduling,GPGPU Benchmarks,Graphics Workloads,SIMT execution model,SIMT width,Concurrent Warps, | en |
dc.relation.page | 45 | |
dc.rights.note | 有償授權 | |
dc.date.accepted | 2012-08-17 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-101-1.pdf 目前未授權公開取用 | 4.21 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。