統一計算架構自動調諧框架

Yu-Wei Chu; 朱祐葳

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86785

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	郭斯彥(Sy-Yen Kuo)
dc.contributor.author	Yu-Wei Chu	en
dc.contributor.author	朱祐葳	zh_TW
dc.date.accessioned	2023-03-20T00:17:39Z	-
dc.date.copyright	2022-08-10
dc.date.issued	2022
dc.date.submitted	2022-07-25
dc.identifier.citation	M. Harris, “Nvidia cuda 6.0,” 2013, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ S. N, “Beyond gpu memory limits with unified memory on pascal,” 2016, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/ W. Li, G. Jin, X. Cui, and S. See, “An evaluation of unified memory technology on NVIDIA gpus,” in 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4-7, 2015. IEEE Computer Society, 2015, pp. 1092–1098. [Online]. Available: https://doi.org/10.1109/CCGrid.2015.105 H. Xu, M. Emani, P. Lin, L. Hu, and C. Liao, “Machine learning guided optimal use of GPU unified memory,” in 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing, MCHPC@SC 2019, Denver, CO, USA, November 18, 2019. IEEE, 2019, pp. 64–70. [Online]. Available: https://doi.org/10.1109/MCHPC49590.2019.00016 P. Bruel, M. Amaris, and A. Goldman, “Autotuning CUDA compiler parameters for heterogeneous applications using the opentuner framework,” Concurr. Comput.Pract. Exp., vol. 29, no. 22, 2017. [Online]. Available: https://doi.org/10.1002/cpe. 3973 J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. O’Reilly, and S. P. Amarasinghe, “Opentuner: an extensible framework for program autotuning,” in International Conference on Parallel Architectures and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014, J. N. Amaral and J. Torrellas, Eds. ACM, 2014, pp. 303–316. [Online]. Available: https: //doi.org/10.1145/2628071.2628092 S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, USA. IEEE Computer Society, 2009, pp. 44–54. [Online]. Available: https://doi.org/10.1109/IISWC.2009.5306797 Y. Gu, W. Wu, Y. Li, and L. Chen, “Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus,” CoRR, vol. abs/2007.09822, 2020. [Online]. Available: https://arxiv.org/abs/2007.09822 NVIDIA, “Nvidia cuda,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/cuda-toolkit D. Mustafa, “A survey of performance tuning techniques and tools for parallel applications,” IEEE Access, vol. 10, pp. 15 036–15 055, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3147846 C. Lattner and V. S. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 2004, pp. 75–88. [Online]. Available: https://doi.org/10.1109/CGO.2004.1281665 R. Landaverde, T. Zhang, A. K. Coskun, and M. C. Herbordt, “An investigation of unified memory access performance in CUDA,” in IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9-11, 2014. IEEE, 2014, pp. 1–6. [Online]. Available: https://doi.org/10.1109/ HPEC.2014.7040988 P. Pirkelbauer, P. Lin, T. Vanderbruggen, and C. Liao, “Xplacer: Automatic analysis of data access patterns on heterogeneous CPU/GPU systems,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18-22, 2020. IEEE, 2020, pp. 997–1007. [Online]. Available: https://doi.org/10.1109/IPDPS47924.2020.00106 D. Quinlan and C. Liao, “The ROSE source-to-source compiler infrastructure,” in Cetus users and compiler infrastructure workshop, in conjunction with PACT, vol. 2011. Citeseer, 2011, p. 1. A. Basumallik and R. Eigenmann, “Optimizing irregular shared-memory applications for distributed-memory systems,” in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2006, New York, New York, USA, March 29-31, 2006, J. Torrellas and S. Chatterjee, Eds. ACM, 2006, pp. 119–128. [Online]. Available: https://doi.org/10.1145/1122971.1122990 T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, “Automatic CPU-GPU communication management and optimization,” in Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, M. W. Hall and D. A. Padua, Eds. ACM, 2011, pp. 142–151. [Online]. Available: https://doi.org/10.1145/1993498.1993516 T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August, “Dynamically managed data for CPU-GPU architectures,” in 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2012, San Jose, CA, USA, March 31 - April 04, 2012, C. Eidt, A. M. Holler, U. Srinivasan, and S. P. Amarasinghe, Eds. ACM, 2012, pp. 165–174. [Online]. Available: https://doi.org/10.1145/2259016.2259038 S. Che, J. W. Sheaffer, and K. Skadron, “Dymaxion: optimizing memory access patterns for heterogeneous systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, S. A. Lathrop, J. Costa, and W. Kramer, Eds. ACM, 2011, pp. 13:1–13:11. [Online]. Available: https://doi.org/10.1145/2063384.2063401 M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA pascal and volta gpus,” J. Supercomput., vol. 75, no. 11, pp. 7625–7645, 2019. [Online]. Available: https://doi.org/10.1007/s11227-019-02966-8 NVIDIA, “Nvidia nsight toolkit,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/nsight-eclipse-edition L. Stoltzfus, M. Emani, P. Lin, and C. Liao, “Data placement optimization in GPU memory hierarchy using predictive modeling,” in Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. ACM, 2018, pp. 45–49. [Online]. Available: https://doi.org/10.1145/3286475.3286482 NVIDIA, “Nvidia nvprofiler,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide/index.html L. Li and B. M. Chapman, “Compiler assisted hybrid implicit and explicit GPU memory management under unified address space,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, M. Taufer, P. Balaji, and A. J. Peña, Eds. ACM, 2019, pp. 51:1–51:16. [Online]. Available: https://doi.org/10.1145/3295500.3356141 P. Barua, J. Zhao, and V. Sarkar, “Ompmemopt: Optimized memory movement for heterogeneous computing,” in Euro-Par 2020: Parallel Processing - 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, August 24-28, 2020, Proceedings, ser. Lecture Notes in Computer Science, M. Malawski and K. Rzadca, Eds., vol. 12247. Springer, 2020, pp. 200–216. [Online]. Available: https://doi.org/10.1007/978-3-030-57675-2_13 J. Gómez-Luna, I. E. Hajj, L. Chang, V. Garcia-Flores, S. G. D. Gonzalo, T. B. Jablin, A. J. Peña, and W. W. Hwu, “Chai: Collaborative heterogeneous applications for integrated-architectures,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2017, Santa Rosa, CA,USA, April 24-25, 2017. IEEE Computer Society, 2017, pp. 43–54. [Online]. Available: https://doi.org/10.1109/ISPASS.2017.7975269 R. Nambiar and M. Poess, Eds., Performance Evaluation and Benchmarking for the Era of Artificial Intelligence - 10th TPC Technology Conference, TPCTC 2018, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers, ser. Lecture Notes in Computer Science, vol. 11135. Springer, 2019. [Online]. Available: https://doi.org/10.1007/978-3-030-11404-6
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86785	-
dc.description.abstract	圖形處理器為一種可高度平行運算之計算設備，相較於中央處理器能，能更有效率的平行處理大量的資料。然而，圖形處理器以及中央處理器間的異質記憶體架構造成資料搬運的成本，降低了圖形處理器的效能。輝達在統一記憶體架構 6.0 版本中，提出了統一記憶體的方法。中央處理器以及圖形處理器之記憶體被視為在同一個記憶體空間中。這項技術減輕了程式設計師的負擔，讓使用者不必再人工管理繁雜的記憶體搬運。然而統一記憶體架構可能會造成效能低下，即使輝達提供了進階應用程式介面，讓程式設計師可以傳遞記憶體建議給統一記憶體驅動，如何選擇正確的記憶體建議卻也相當困難。在這篇論文中，我們提出了一個輕量化的自適應框架，用以找出對於各個應用程式中最合適的統一記憶體建議。實驗結果顯示我們的方法最高可以提升圖形處理器至多12倍的總體運算效能。	zh_TW
dc.description.abstract	A Graphics Processing Unit (GPU) is a highly parallel computing device that processes large blocks of data more efficiently than a Compute Processing Unit (CPU). However, the heterogeneity of memory between GPU and CPU could lead to significant data movement overhead. In CUDA (Compute Unified Device Architecture) 6.0, NVIDIA proposed the Unified Memory (UM) method, and hence the memory of CPU and GPU is seen as a single memory scope. This technique relieves programmers＇ burden of manually managing tedious data movement between CPU and GPU. However, UM could lead to low efficiency and huge performance degradation. Although NVIDIA has provided advanced APIs for programmers to pass memory hints to the UM driver, it is still a problem to choose the right UM advice. In this thesis, we proposed a lightweight auto-tuning framework to find the optimal UM advice for each application. Results show that our approach can achieve up to 12x speedup in overall GPU performance in the nw application.	en
dc.description.provenance	Made available in DSpace on 2023-03-20T00:17:39Z (GMT). No. of bitstreams: 1 U0001-2207202213470700.pdf: 5675250 bytes, checksum: ccf851c2c8630c39a1cc9537571415f8 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	誌謝 i 摘要 ii Abstract iii Contents iv List of Figures vi List of Tables vii Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 GPU Architecture 6 2.2 Compute Unified Device Architecture 6 2.3 Explicit Data Movement and Unified Memory 7 2.4 UM Advanced API 8 2.5 Auto-tuning 10 2.6 LLVM 11 Chapter 3 Related Works 14 Chapter 4 Proposed Method 18 4.1 Design Overview 18 4.2 UM Manipulator 19 4.2.1 Source Translation 19 4.2.2 LLVM Transformation 21 4.3 UM Tuner 22 Chapter 5 Experiments and Results 24 5.1 Experiment Settings 24 5.2 Benchmark 24 5.2.1 The Rodinia Benchmark Suite 24 5.2.2 Other Benchmark Suite 26 5.3 Tuning Parameters 27 5.4 GPU Activity Profiling 27 5.5 Performance Improvement 29 5.6 Oversubscription 31 5.7 cudaMemAdvise and cudaMemPrefetchAsync 33 5.8 Tuning Convergence 33 5.9 Discussion 34 Chapter 6 Conclusion and Future Works 36 References 38 2.1 Three-phase Design 12 4.1 Design Overview 19 4.2 LLVM Code Sample 22 5.1 GPU Activity Profiling 28 5.2 Kernel Execution Speedup 29 5.3 Memory Movement Time 30 5.4 Overall Speedup 31 5.5 gesummv Execution Time in Different Input Size 32 5.6 Configuration Comparison 32 5.7 nw Tuning Performance 33 5.8 gesummv Tuning Performance 34 5.1 Input argument for each application 27 5.2 Tuning Space 27
dc.language.iso	en
dc.subject	編譯器	zh_TW
dc.subject	自動調節	zh_TW
dc.subject	圖形處理器	zh_TW
dc.subject	異構計算	zh_TW
dc.subject	編譯器	zh_TW
dc.subject	統一記憶體架構	zh_TW
dc.subject	自動調節	zh_TW
dc.subject	圖形處理器	zh_TW
dc.subject	異構計算	zh_TW
dc.subject	統一記憶體架構	zh_TW
dc.subject	GPU	en
dc.subject	heterogeneous computing	en
dc.subject	auto-tuning	en
dc.subject	compiler	en
dc.subject	unified memory	en
dc.subject	heterogeneous computing	en
dc.subject	GPU	en
dc.subject	auto-tuning	en
dc.subject	compiler	en
dc.subject	unified memory	en
dc.title	統一計算架構自動調諧框架	zh_TW
dc.title	Auto-Tuning Framework for CUDA Unified Memory	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	顏嗣鈞(Hsu-Chun Yen),雷欽隆(Chin-Laung Lei),陳英一(Ing-Yi Chen),袁世一(Shih-Yi Yuan)
dc.subject.keyword	異構計算,圖形處理器,自動調節,編譯器,統一記憶體架構,	zh_TW
dc.subject.keyword	heterogeneous computing,,GPU,auto-tuning,compiler,unified memory,	en
dc.relation.page	43
dc.identifier.doi	10.6342/NTU202201635
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2022-07-26
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
dc.date.embargo-lift	2022-08-10	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
U0001-2207202213470700.pdf	5.54 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。