請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86785完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 郭斯彥(Sy-Yen Kuo) | |
| dc.contributor.author | Yu-Wei Chu | en |
| dc.contributor.author | 朱祐葳 | zh_TW |
| dc.date.accessioned | 2023-03-20T00:17:39Z | - |
| dc.date.copyright | 2022-08-10 | |
| dc.date.issued | 2022 | |
| dc.date.submitted | 2022-07-25 | |
| dc.identifier.citation | M. Harris, “Nvidia cuda 6.0,” 2013, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ S. N, “Beyond gpu memory limits with unified memory on pascal,” 2016, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/ W. Li, G. Jin, X. Cui, and S. See, “An evaluation of unified memory technology on NVIDIA gpus,” in 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4-7, 2015. IEEE Computer Society, 2015, pp. 1092–1098. [Online]. Available: https://doi.org/10.1109/CCGrid.2015.105 H. Xu, M. Emani, P. Lin, L. Hu, and C. Liao, “Machine learning guided optimal use of GPU unified memory,” in 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing, MCHPC@SC 2019, Denver, CO, USA, November 18, 2019. IEEE, 2019, pp. 64–70. [Online]. Available: https://doi.org/10.1109/MCHPC49590.2019.00016 P. Bruel, M. Amaris, and A. Goldman, “Autotuning CUDA compiler parameters for heterogeneous applications using the opentuner framework,” Concurr. Comput.Pract. Exp., vol. 29, no. 22, 2017. [Online]. Available: https://doi.org/10.1002/cpe. 3973 J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. O’Reilly, and S. P. Amarasinghe, “Opentuner: an extensible framework for program autotuning,” in International Conference on Parallel Architectures and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014, J. N. Amaral and J. Torrellas, Eds. ACM, 2014, pp. 303–316. [Online]. Available: https: //doi.org/10.1145/2628071.2628092 S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, USA. IEEE Computer Society, 2009, pp. 44–54. [Online]. Available: https://doi.org/10.1109/IISWC.2009.5306797 Y. Gu, W. Wu, Y. Li, and L. Chen, “Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus,” CoRR, vol. abs/2007.09822, 2020. [Online]. Available: https://arxiv.org/abs/2007.09822 NVIDIA, “Nvidia cuda,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/cuda-toolkit D. Mustafa, “A survey of performance tuning techniques and tools for parallel applications,” IEEE Access, vol. 10, pp. 15 036–15 055, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3147846 C. Lattner and V. S. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 2004, pp. 75–88. [Online]. Available: https://doi.org/10.1109/CGO.2004.1281665 R. Landaverde, T. Zhang, A. K. Coskun, and M. C. Herbordt, “An investigation of unified memory access performance in CUDA,” in IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9-11, 2014. IEEE, 2014, pp. 1–6. [Online]. Available: https://doi.org/10.1109/ HPEC.2014.7040988 P. Pirkelbauer, P. Lin, T. Vanderbruggen, and C. Liao, “Xplacer: Automatic analysis of data access patterns on heterogeneous CPU/GPU systems,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18-22, 2020. IEEE, 2020, pp. 997–1007. [Online]. Available: https://doi.org/10.1109/IPDPS47924.2020.00106 D. Quinlan and C. Liao, “The ROSE source-to-source compiler infrastructure,” in Cetus users and compiler infrastructure workshop, in conjunction with PACT, vol. 2011. Citeseer, 2011, p. 1. A. Basumallik and R. Eigenmann, “Optimizing irregular shared-memory applications for distributed-memory systems,” in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2006, New York, New York, USA, March 29-31, 2006, J. Torrellas and S. Chatterjee, Eds. ACM, 2006, pp. 119–128. [Online]. Available: https://doi.org/10.1145/1122971.1122990 T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, “Automatic CPU-GPU communication management and optimization,” in Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, M. W. Hall and D. A. Padua, Eds. ACM, 2011, pp. 142–151. [Online]. Available: https://doi.org/10.1145/1993498.1993516 T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August, “Dynamically managed data for CPU-GPU architectures,” in 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2012, San Jose, CA, USA, March 31 - April 04, 2012, C. Eidt, A. M. Holler, U. Srinivasan, and S. P. Amarasinghe, Eds. ACM, 2012, pp. 165–174. [Online]. Available: https://doi.org/10.1145/2259016.2259038 S. Che, J. W. Sheaffer, and K. Skadron, “Dymaxion: optimizing memory access patterns for heterogeneous systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, S. A. Lathrop, J. Costa, and W. Kramer, Eds. ACM, 2011, pp. 13:1–13:11. [Online]. Available: https://doi.org/10.1145/2063384.2063401 M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA pascal and volta gpus,” J. Supercomput., vol. 75, no. 11, pp. 7625–7645, 2019. [Online]. Available: https://doi.org/10.1007/s11227-019-02966-8 NVIDIA, “Nvidia nsight toolkit,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/nsight-eclipse-edition L. Stoltzfus, M. Emani, P. Lin, and C. Liao, “Data placement optimization in GPU memory hierarchy using predictive modeling,” in Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. ACM, 2018, pp. 45–49. [Online]. Available: https://doi.org/10.1145/3286475.3286482 NVIDIA, “Nvidia nvprofiler,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide/index.html L. Li and B. M. Chapman, “Compiler assisted hybrid implicit and explicit GPU memory management under unified address space,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, M. Taufer, P. Balaji, and A. J. Peña, Eds. ACM, 2019, pp. 51:1–51:16. [Online]. Available: https://doi.org/10.1145/3295500.3356141 P. Barua, J. Zhao, and V. Sarkar, “Ompmemopt: Optimized memory movement for heterogeneous computing,” in Euro-Par 2020: Parallel Processing - 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, August 24-28, 2020, Proceedings, ser. Lecture Notes in Computer Science, M. Malawski and K. Rzadca, Eds., vol. 12247. Springer, 2020, pp. 200–216. [Online]. Available: https://doi.org/10.1007/978-3-030-57675-2_13 J. Gómez-Luna, I. E. Hajj, L. Chang, V. Garcia-Flores, S. G. D. Gonzalo, T. B. Jablin, A. J. Peña, and W. W. Hwu, “Chai: Collaborative heterogeneous applications for integrated-architectures,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2017, Santa Rosa, CA,USA, April 24-25, 2017. IEEE Computer Society, 2017, pp. 43–54. [Online]. Available: https://doi.org/10.1109/ISPASS.2017.7975269 R. Nambiar and M. Poess, Eds., Performance Evaluation and Benchmarking for the Era of Artificial Intelligence - 10th TPC Technology Conference, TPCTC 2018, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers, ser. Lecture Notes in Computer Science, vol. 11135. Springer, 2019. [Online]. Available: https://doi.org/10.1007/978-3-030-11404-6 | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86785 | - |
| dc.description.abstract | 圖形處理器為一種可高度平行運算之計算設備,相較於中央處理器能,能更有效率的平行處理大量的資料。然而,圖形處理器以及中央處理器間的異質記憶體架構造成資料搬運的成本,降低了圖形處理器的效能。 輝達在統一記憶體架構 6.0 版本中,提出了統一記憶體的方法。中央處理器以及圖形處理器之記憶體被視為在同一個記憶體空間中。這項技術減輕了程式設計師的負擔,讓使用者不必再人工管理繁雜的記憶體搬運。然而統一記憶體架構可能會造成效能低下,即使輝達提供了進階應用程式介面,讓程式設計師可以傳遞記憶體建議給統一記憶體驅動,如何選擇正確的記憶體建議卻也相當困難。 在這篇論文中,我們提出了一個輕量化的自適應框架,用以找出對於各個應用程式中最合適的統一記憶體建議。實驗結果顯示我們的方法最高可以提升圖形處理器至多12倍的總體運算效能。 | zh_TW |
| dc.description.abstract | A Graphics Processing Unit (GPU) is a highly parallel computing device that processes large blocks of data more efficiently than a Compute Processing Unit (CPU). However, the heterogeneity of memory between GPU and CPU could lead to significant data movement overhead. In CUDA (Compute Unified Device Architecture) 6.0, NVIDIA proposed the Unified Memory (UM) method, and hence the memory of CPU and GPU is seen as a single memory scope. This technique relieves programmers' burden of manually managing tedious data movement between CPU and GPU. However, UM could lead to low efficiency and huge performance degradation. Although NVIDIA has provided advanced APIs for programmers to pass memory hints to the UM driver, it is still a problem to choose the right UM advice. In this thesis, we proposed a lightweight auto-tuning framework to find the optimal UM advice for each application. Results show that our approach can achieve up to 12x speedup in overall GPU performance in the nw application. | en |
| dc.description.provenance | Made available in DSpace on 2023-03-20T00:17:39Z (GMT). No. of bitstreams: 1 U0001-2207202213470700.pdf: 5675250 bytes, checksum: ccf851c2c8630c39a1cc9537571415f8 (MD5) Previous issue date: 2022 | en |
| dc.description.tableofcontents | 誌謝 i 摘要 ii Abstract iii Contents iv List of Figures vi List of Tables vii Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 GPU Architecture 6 2.2 Compute Unified Device Architecture 6 2.3 Explicit Data Movement and Unified Memory 7 2.4 UM Advanced API 8 2.5 Auto-tuning 10 2.6 LLVM 11 Chapter 3 Related Works 14 Chapter 4 Proposed Method 18 4.1 Design Overview 18 4.2 UM Manipulator 19 4.2.1 Source Translation 19 4.2.2 LLVM Transformation 21 4.3 UM Tuner 22 Chapter 5 Experiments and Results 24 5.1 Experiment Settings 24 5.2 Benchmark 24 5.2.1 The Rodinia Benchmark Suite 24 5.2.2 Other Benchmark Suite 26 5.3 Tuning Parameters 27 5.4 GPU Activity Profiling 27 5.5 Performance Improvement 29 5.6 Oversubscription 31 5.7 cudaMemAdvise and cudaMemPrefetchAsync 33 5.8 Tuning Convergence 33 5.9 Discussion 34 Chapter 6 Conclusion and Future Works 36 References 38 2.1 Three-phase Design 12 4.1 Design Overview 19 4.2 LLVM Code Sample 22 5.1 GPU Activity Profiling 28 5.2 Kernel Execution Speedup 29 5.3 Memory Movement Time 30 5.4 Overall Speedup 31 5.5 gesummv Execution Time in Different Input Size 32 5.6 Configuration Comparison 32 5.7 nw Tuning Performance 33 5.8 gesummv Tuning Performance 34 5.1 Input argument for each application 27 5.2 Tuning Space 27 | |
| dc.language.iso | en | |
| dc.subject | 編譯器 | zh_TW |
| dc.subject | 自動調節 | zh_TW |
| dc.subject | 圖形處理器 | zh_TW |
| dc.subject | 異構計算 | zh_TW |
| dc.subject | 編譯器 | zh_TW |
| dc.subject | 統一記憶體架構 | zh_TW |
| dc.subject | 自動調節 | zh_TW |
| dc.subject | 圖形處理器 | zh_TW |
| dc.subject | 異構計算 | zh_TW |
| dc.subject | 統一記憶體架構 | zh_TW |
| dc.subject | GPU | en |
| dc.subject | heterogeneous computing | en |
| dc.subject | auto-tuning | en |
| dc.subject | compiler | en |
| dc.subject | unified memory | en |
| dc.subject | heterogeneous computing | en |
| dc.subject | GPU | en |
| dc.subject | auto-tuning | en |
| dc.subject | compiler | en |
| dc.subject | unified memory | en |
| dc.title | 統一計算架構自動調諧框架 | zh_TW |
| dc.title | Auto-Tuning Framework for CUDA Unified Memory | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 110-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 顏嗣鈞(Hsu-Chun Yen),雷欽隆(Chin-Laung Lei),陳英一(Ing-Yi Chen),袁世一(Shih-Yi Yuan) | |
| dc.subject.keyword | 異構計算,圖形處理器,自動調節,編譯器,統一記憶體架構, | zh_TW |
| dc.subject.keyword | heterogeneous computing,,GPU,auto-tuning,compiler,unified memory, | en |
| dc.relation.page | 43 | |
| dc.identifier.doi | 10.6342/NTU202201635 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2022-07-26 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 電機工程學研究所 | zh_TW |
| dc.date.embargo-lift | 2022-08-10 | - |
| 顯示於系所單位: | 電機工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| U0001-2207202213470700.pdf | 5.54 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
