動態核心編排：提升包含動態執行緒平行度之圖形處理器應用程式之執行效能

Po-Han Wang; 王柏翰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20434

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Po-Han Wang	en
dc.contributor.author	王柏翰	zh_TW
dc.date.accessioned	2021-06-08T02:48:35Z	-
dc.date.copyright	2017-08-28
dc.date.issued	2017
dc.date.submitted	2017-08-18
dc.identifier.citation	[1] J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for gpgpu spatial multitasking. In IEEE International Symposium on High-Performance Comp Architecture, pages 1–12, Feb 2012. [2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al. The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006. [3] D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner. Graph partitioning and graph clustering. In 10th DIMACS Implementation Challenge Workshop, 2012. [4] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174. IEEE, 2009. [5] N. N. Batada, T. Reguly, A. Breitkreutz, L. Boucher, B.-J. Breitkreutz, L. D. Hurst, and M. Tyers. Still stratus not altocumulus: further evidence against the date/party hub distinction. PLoS Biol, 5(6):e154, 2007. [6] L. Bergstrom and J. Reppy. Nested data-parallelism on the gpu. In ACM SIGPLAN Notices, volume 47, pages 247–258. ACM, 2012. [7] M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on gpus. In Workload Characterization (IISWC), 2012 IEEE International Symposium on, pages 141–151. IEEE, 2012. [8] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular gpgpu graph applications. In Workload Characterization (IISWC), 2013 IEEE International Symposium on, pages 185–195. IEEE, 2013. [9] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. Ieee, 2009. [10] G. Chen and X. Shen. Free launch: optimizing gpu dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture, pages 407–419. ACM, 2015. [11] L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang. Improving gpgpu performance via cache locality aware thread block scheduling. IEEE Computer Architecture Letters, 2017. [12] J. Cohen and P. Castonguay. Efficient graph matching and coloring on the gpu. In GPU Technology Conference, 2012. [13] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63–74. ACM, 2010. [14] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, Dec. 2011. [15] G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational algorithms for multi-bulk-synchronous processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pages 301–302, New York, NY, USA, 2013. ACM. [16] J. DiMarco and M. Taufer. Performance impact of dynamic parallelism on different clustering algorithms. In Proc. SPIE, volume 8752, page 87520E, 2013. [17] N. Fermi. Nvidia’s next generation cuda compute architecture: Fermi. Nvidia Whitepaper, 2009. [18] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 407–420. IEEE Computer Society, 2007. [19] B. R. Gaster and L. Howes. Can gpgpu programming be liberated from the dataparallel bottleneck? Computer, 45(8):42–52, 2012. [20] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In ACM SIGARCH Computer Architecture News, volume 39, pages 235–246. ACM, 2011. [21] N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008. [22] N. G. GTX. 980: Featuring maxwell, the most advanced gpu ever made. NVIDIA Corporation, 2014. [23] N. G. GTX. Nvidia geforce gtx 1080. NVIDIA Corporation, 2016. [24] K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing (InPar), 2012, pages 1–14. IEEE, 2012. [25] I. E. Hajj, J. Gomez-Luna, C. Li, L.-W. Chang, D. Milojicic, and W. mei Hwu. Klap: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE/ACM, 2016. [26] M. Harris. Cuda pro tip: Occupancy api simplifies launch configuration, 2014. [27] W. Jia, K. A. Shaw, and M. Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 272–283, Feb 2014. [28] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Owl: cooperative thread array aware scheduling techniques for improving gpgpu performance. In ACM SIGPLAN Notices, volume 48, pages 395–406. ACM, 2013. [29] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for gpgpus. In ACM SIGARCH Computer Architecture News, volume 41, pages 332–343. ACM, 2013. [30] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157–166. IEEE Press, 2013. [31] A. Kerr, G. Diamos, and S. Yalamanchili. A characterization and analysis of ptx kernels. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 3–12. IEEE, 2009. [32] Khronos. Opencl - the open standard for parallel programming of heterogeneous systems, 2016. [33] J. Kim and C. Batten. Accelerating irregular algorithms on gpgpus using fine-grain hardware worklists. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 75–87. IEEE Computer Society, 2014. [34] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 65–76, April 2009. [35] H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on gpus. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 63–74. IEEE Computer Society, 2014. [36] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260–271. IEEE, 2014. [37] S.-Y. Lee, A. Arunkumar, and C.-J. Wu. Cawa: Coordinated warp scheduling and cache prioritization for critical warp acceleration of gpgpu workloads. In ACM SIGARCH Computer Architecture News, volume 43, pages 515–527. ACM, 2015. [38] S.-Y. Lee and C.-J. Wu. Caws: criticality-aware warp scheduling for gpgpu workloads. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 175–186. ACM, 2014. [39] A. Li, S. L. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal. Locality-aware cta clustering for modern gpus. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 297–311. ACM, 2017. [40] D. Li, H. Wu, and M. Becchi. Nested parallelism on gpu: Exploring parallelization templates for irregular loops and recursive computations. In Parallel Processing (ICPP), 2015 44th International Conference on, pages 979–988. IEEE, 2015. [41] J. Liu, J. Yang, and R. Melhem. Saws: Synchronization aware gpgpu warp scheduling for multiple independent warp schedulers. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 383–394, New York, NY, USA, 2015. ACM. [42] S. Liu, J. E. Lindholm, M. Y. Siu, B. W. Coon, and S. F. Oberman. Operand collector architecture, Nov. 16 2010. US Patent 7,834,881. [43] Y. Liu, Z. Yu, L. Eeckhout, V. J. Reddi, Y. Luo, X. Wang, Z. Wang, and C. Xu. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, pages 42:1–42:12, New York, NY, USA, 2016. ACM. [44] M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM journal on computing, 15(4):1036–1053, 1986. [45] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010. [46] D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. SIGPLAN Not., 47(8):117–128, Feb. 2012. [47] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 308–317. ACM, 2011. [48] R. Nath, R. Ayoub, and T. S. Rosing. Temperature aware thread block scheduling in gpgpus. In Proceedings of the 50th Annual Design Automation Conference, page 177. ACM, 2013. [49] NVIDIA. Cuda occupancy calculator, 2007. [50] NVIDIA. Cuda dynamic parallelism programming guide, 2012. [51] NVIDIA. Nvidia’s next generation cuda compute architecture: Kepler gk110, 2012. [52] NVIDIA. Cuda c best practices guide, 2015. [53] NVIDIA. Cuda parallel computing platform, 2016. [54] C. Nvidia. Cuda c programming guide. Nvidia Corporation, 2016. [55] L. Nyland, J. R. Nickolls, G. Hirota, and T. Mandal. Systems and methods for coalescing memory accesses of parallel threads, Dec. 27 2011. US Patent 8,086,806. [56] M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Fine-grain task aggregation and coordination on gpus. In ACM SIGARCH Computer Architecture News, volume 42, pages 181–192. IEEE Press, 2014. [57] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999. [58] S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil. Preemptive thread block scheduling with online structural runtime prediction for concurrent gpgpu kernels. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 483–484. ACM, 2014. [59] J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. ACM SIGARCH Computer Architecture News, 43(1):593–606, 2015. [60] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society. [61] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99–110, New York, NY, USA, 2013. ACM. [62] Y. Saad. Sparskit: A basic tool kit for sparse matrix computations. 1990. [63] A. Sethia, D. A. Jamshidi, and S. Mahlke. Mascar: Speeding up gpu warps by reducing memory pitstops. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 174–185. IEEE, 2015. [64] L. W. Shieh, K. C. Chen, H. C. Fu, P. H. Wang, and C. L. Yang. Enabling fast preemption via dual-kernel support on gpus. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pages 121–126, Jan 2017. [65] M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache. In Proceedings of the 22nd annual international conference on Supercomputing, pages 309–318. ACM, 2008. [66] M. Steffen and J. Zambreno. Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 237–248. IEEE, 2010. [67] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012. [68] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on gpus. In ACM SIGARCH Computer Architecture News, volume 42, pages 193–204. IEEE Press, 2014. [69] X. Tang, A. Pattnaik, H. Jiang, O. Kayiran, A. Jog, M. I. Sreepathi Pai, M. T. Kandemir, and C. R. Das. Controlled kernel launch for dynamic parallelism in gpus. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, ACM, 2017. [70] J. Tölke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. International Journal of Computational Fluid Dynamics, 22(7):443–456, 2008. [71] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pages 839–846, Jan 1998. [72] F. Wang, J. Dong, and B. Yuan. Graph-based substructure pattern mining using cuda dynamic parallelism. In International Conference on Intelligent Data Engineering and Automated Learning, pages 342–349. Springer, 2013. [73] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus. SIGARCH Comput. Archit. News, 43(3):528–540, June 2015. [74] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Laperm: Locality aware scheduler for dynamic parallelism on gpus. In The 43rd International Symposium on Computer Architecture (ISCA), June 2016. [75] J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured gpu applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 51–60, Oct 2014. [76] L. Wang, R.-W. Tsai, S.-C. Wang, K.-C. Chen, P.-H. Wang, H.-Y. Cheng, Y.-C. Lee, S.-J. Shu, C.-C. Yang, M.-Y. Hsu, L.-C. Kan, C.-L. Lee, T.-C. Yu, R.-D. Peng, C.-L. Yang, Y.-S. Hwang, J.-K. Lee, S.-L. Tsao, and M. Ouhyoung. Analyzing opencl 2.0 workloads using a heterogeneous cpu-gpu simulator. In Proceesings of 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2017. [77] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel gpu: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 358–369, March 2016. [78] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 235–246. IEEE, 2010. [79] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 107–118. IEEE Computer Society, 2012. [80] Q. Yang. Recursive bilateral filtering. In European Conference on Computer Vision, pages 399–413. Springer, 2012. [81] Y. Yang, C. Li, and H. Zhou. Cuda-np: Realizing nested thread-level parallelism in gpgpu applications. Journal of Computer Science and Technology, 30(1):3–19, 2015. [82] Z. Yang, Y. Zhu, and Y. Pu. Parallel image processing based on cuda. In Computer Science and Software Engineering, 2008 International Conference on, volume 3, pages 198–201. IEEE, 2008. [83] Y. Yu, X. He, H. Guo, Y. Wang, and X. Chen. A credit-based load-balance-aware cta scheduling optimization scheme in gpgpu. International Journal of Parallel Programming, 44(1):109–129, 2016. [84] Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen. A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in gpgpus. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pages 15–24, New York, NY, USA, 2015. ACM. [85] E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In ACM SIGARCH Computer Architecture News, volume 39, pages 369–380. ACM, 2011. [86] P. Zhang, E. Holk, J. Matty, S. Misurda, M. Zalewski, J. Chu, S. McMillan, and A. Lumsdaine. Dynamic parallelism for simple and efficient gpu graph algorithms. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, page 11. ACM, 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20434	-
dc.description.abstract	圖形處理器(GPU) 由於其高生產量(throughput)、高平行度(parallelism) 以及高能耗效率(energy efficiency) 之特性，為近十年來最受矚目之處理單元。許多研究均發現具高度資料平行(data-parallelism) 特性之應用程式可於GPU 上取得數十倍甚至數百倍於中央處理器(CPU) 之效能提升。然而近年來有越來越多於GPU 上執行不規則應用程式(irregular applications) 之需求，例如大數據或機器學習等。這些不規則應用程式多半包含資料相依(data-dependent) 之控制流(control-flow) 或記憶體存取，並包含動態生成之執行緒層級平行度(dynamic parallelism)，使其無法有效地透過目前GPU 所支援之程式設計模型(programming model) 達到較高之執行效率。近年來新的GPU 架構透過透過允許各執行緒(thread) 於執行過程中產生新子核心(child kernels)，稱為device-side kernel launch，來平行化前述之動態執行緒平行度。然而，許多研究發現目前GPU 對device-side kernel launch 的軟硬體實作，因為子核心的啟動延遲(invocation overhead) 及低執行效率而無法有效利用GPU 的高生產量。很多時候甚至會造成效能的下降。本論文試著從根源處，也就是device-side kernel launch 的程式設計模型，來了解其限制。本論文發現程式設計者必須在撰寫過程中自行設定device-side kernel launch 的重要參數，但動態平行應用程式的特性使得使用者很難選到能讓程式執行有效率的參數。此外，也發現到有很多有機會平行執行的動態工作(dynamic tasks) 必須要循序執行(execute in serial)。與其透過調整參數或其他軟硬體方法來減少啟動延遲，本論文的目標在於執行動態平行應用程式時將GPU 的效能完全釋放。換句話說，將「所有動態產生的工作」平行地執行於子核心中，在執行過程裡保持高效率，且避免程式設計者花費過多的力氣在參數選擇上。透過分析，我整理出了下列三個完全利用子核心平行度潛能的挑戰：〈一〉雖然理想的執行參數(execution parameters) 可以透過離線分析(off-line profiling) 取得，但要直接使用到執行緒數量不一的子核心上，往往不能收到預期的效果；〈二〉同時產生的子核心數量過多，執行緒區塊(thread block) 調度(dispatching) 過程中需要從DRAM讀取其中介資料(meta data) 而拖慢整體執行速度；以及〈三〉子核心執行序數量普遍較少，造成其執行過程裡核心參數(kernel parameters) 的讀取較無效率。對此，本論文提出了一個整體性的解決方案，稱之為動態核心編排(Dynamic Kernel Consolidation and Scheduling, DKCS)，希望能在 GPU 執行包含device-side kernel launch 之應用程式時完全發揮其高平行度及高生產量。此機制的核心為一個包含編譯器及硬體支援的動態核心融合(Dynamic Kernel Consolidation, DKC) 機制，將大量有著不同數量執行緒的子核心打散重組，並選擇更合適的執行參數以提高其執行效能。針對同時產生子核心數量過多的問題，DKCS 提出考量動態平行之GPU 排程機制(Dynamic Parallelism-aware GPU Scheduling, DPS)，利用於GPU SM 裡同步多工處理的技術(simultaneous multikernel, SMK) 同時執行親/子核心，並在需要的時候優先執行子核心來避免執行緒區塊調度時額外的DRAM 存取延遲。對於子核心內執行緒數量過少的問題，DKCS 提出使用另外一條data-path, read-only cache, 而非原本的 constant cache 來存取核心參數以提高存取效能。同時，DKCS 可以讓程式的撰寫過程變得非常簡單，程式設計者不再需要繁瑣地調整子核心啟動的條件或執行參數來確保整體執行效率。實驗結果可發現，本論文所提出之動態核心編排機制相比於過去降低單一核心啟動延遲之 dynamic thread-block launch (DTBL) 機制平均可提高64% 之執行效能，對於動態調整親子核心工作分配比例的SPAWN 機制亦可提高44% 之平均效能。	zh_TW
dc.description.abstract	The demand of accelerating irregular applications, which often operate on unstructured data sets such as trees, graphs, or sparse matrices, and therefore have dynamic, workload-dependent nested thread-level parallelism with hard to predict parallelism degree, on GPUs has increased recently. Modern GPUs support device-side kernel launch (or CUDA dynamic parallelism, CDP) to improve the programmability of irregular GPGPU applications with dynamic parallelism. However, researches have shown that current GPU implementation of device-side kernel launch fails to make good use of its high throughput because of the non-trivial kernel invocation overhead and the inefficient execution of device-launched kernels. This dissertation aims at fully exploiting the potential of device-side kernel launch. I analyze the limitation of device-side kernel launch based on the CDP programming model and find that due to the CDP programming model, which requires a programmer to determine important parameters for device-side kernel launch, and the very nature of dynamic parallelism, which makes selecting parameters maximizing execution efficiency extremely difficult, device-launched kernels are generally executed with low utilization. Furthermore, still a significant part of parallelizable tasks are forced to execute in serial, which impairs the potential of device-side kernel launch. To fully unleash the benefit of device kernels, I propose an ideal version of CDP code and summarizethe challenges of efficiently executing it as follows. First, child kernels should be executed with tuned execution parameters. Though favorable execution parameters can be statically calculated, how to apply them on the child kernels is non-trivial, as the amount of threads in a child kernel is dynamically set at run-time and may not fit the selected execution parameters. Second, in the ideal CDP codes, the number of device kernels are often in order of magnitude larger than host kernels (from thousands to hundreds of thousands), but with much fewer threads per kernel (average 55 for the tested workloads). Massive device kernels with few threads per kernel lead to inefficient meta data and kernel parameters accesses during execution. Based on the analyses, I propose a Dynamic Kernel Consolidation and Scheduling (DKCS) framework as the first attempt to fully exploit the potential of device-side kernel launch. The key innovation of DKCS is a run-time layer that performs kernel fusing/splitting and TB regrouping operations to form consolidated kernels and thread blocks so that the selected grid and block sizes can be successfully applied on. DKCS also includes two features to resolve the inefficiency incurred by massive and small child kernels: a block and warp scheduling policy for efficient meta data accessing, and utilizing an alternative data path, the read-only cache, for efficient child kernel parameter accessing. With DKCS, programming for dynamic parallelism becomes very simple and easy without cumbersome tuning of threshold values for launching device-side kernels and kernel execution parameters. Results show that the proposed DKCS framework outperforms the mechanism that reduces the invocation overhead of a single child kernel, dynamic thread-block launch (DTBL) by 64%, and outperforms the mechanism that controls the launches of device-side kernels to prevent cons from overshadowing pros, SPAWN, by 44%.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T02:48:35Z (GMT). No. of bitstreams: 1 ntu-106-F96922002-1.pdf: 7297386 bytes, checksum: 852be7988516715fd0e80187f2770188 (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	口試委員會審定書iii 摘要v Abstract vii 1 Introduction 1 2 Background 7 2.1 GPGPU Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 GPU Hardware Model . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 GPU Programming Model . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 GPU Execution Model . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Execution Parameter Tuning for GPGPU Kernels . . . . . . . . . . . . . 16 2.3 Dynamic Parallelism and Device-side Kernel Launch . . . . . . . . . . . 16 3 Related Work 23 3.1 Irregular GPGPU applications . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Device-side Kernel Launch . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 GPU Resource Management and Scheduling . . . . . . . . . . . . . . . . 30 4 Motivation 37 4.1 Limitation of CDP Programming Model . . . . . . . . . . . . . . . . . . 38 4.2 How to Fully Exploit the Potential of Device-side Kernel Launch? . . . . 41 5 Dynamic Kernel Consolidation and Scheduling (DKCS) Framework 47 5.1 Dynamic Kernel Consolidation (DKC) . . . . . . . . . . . . . . . . . . . 48 5.1.1 Dispatching Consolidated Kernels . . . . . . . . . . . . . . . . . 52 5.1.2 Execution Parameter Selection . . . . . . . . . . . . . . . . . . . 52 5.1.3 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 Dynamic Parallelism-aware TB and Warp Scheduling (DPS) . . . . . . . 54 5.2.1 DPS Block Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.2 DPS Warp Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.3 Cost and Timing Analysis . . . . . . . . . . . . . . . . . . . . . 57 5.2.4 Joint Use with Existing Schedulers . . . . . . . . . . . . . . . . . 58 5.3 Efficient Kernel Parameter Loading from the Read-Only Cache (ROC) . . 58 6 Experimental Evaluations 61 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Effects of DKCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.3 DKCS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 MDB Size Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.5 DKCS on Different GPU Architectures . . . . . . . . . . . . . . . . . . 73 7 Concluding Remarks 79 Bibliography 83
dc.language.iso	en
dc.title	動態核心編排：提升包含動態執行緒平行度之圖形處理器應用程式之執行效能	zh_TW
dc.title	Dynamic Kernel Consolidation and Scheduling: Improving Performance of Irregular Applications with Dynamic Parallelism on GPUs	en
dc.type	Thesis
dc.date.schoolyear	105-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	簡韶逸(Shao-Yi Chien),施吉昇(Chi-Sheng Shih),阮聖彰(Shanq-Jang Ruan),謝仁偉(Jen-Wei Hsieh),林泰吉(Tay-Jyi Lin)
dc.subject.keyword	圖形處理器,圖形處理器端核心啟動,CUDA動態平行,不規則GPGPU應用程式,圖形處理器多工,圖形處理器排程,	zh_TW
dc.subject.keyword	GPU,Device-side kernel launch,CUDA dynamic parallelism,Irregular GPGPU applications,GPU multitasking,GPU scheduling,	en
dc.relation.page	93
dc.identifier.doi	10.6342/NTU201703497
dc.rights.note	未授權
dc.date.accepted	2017-08-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 目前未授權公開取用	7.13 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。