Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86785
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor郭斯彥(Sy-Yen Kuo)
dc.contributor.authorYu-Wei Chuen
dc.contributor.author朱祐葳zh_TW
dc.date.accessioned2023-03-20T00:17:39Z-
dc.date.copyright2022-08-10
dc.date.issued2022
dc.date.submitted2022-07-25
dc.identifier.citationM. Harris, “Nvidia cuda 6.0,” 2013, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ S. N, “Beyond gpu memory limits with unified memory on pascal,” 2016, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/ W. Li, G. Jin, X. Cui, and S. See, “An evaluation of unified memory technology on NVIDIA gpus,” in 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4-7, 2015. IEEE Computer Society, 2015, pp. 1092–1098. [Online]. Available: https://doi.org/10.1109/CCGrid.2015.105 H. Xu, M. Emani, P. Lin, L. Hu, and C. Liao, “Machine learning guided optimal use of GPU unified memory,” in 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing, MCHPC@SC 2019, Denver, CO, USA, November 18, 2019. IEEE, 2019, pp. 64–70. [Online]. Available: https://doi.org/10.1109/MCHPC49590.2019.00016 P. Bruel, M. Amaris, and A. Goldman, “Autotuning CUDA compiler parameters for heterogeneous applications using the opentuner framework,” Concurr. Comput.Pract. Exp., vol. 29, no. 22, 2017. [Online]. Available: https://doi.org/10.1002/cpe. 3973 J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. O’Reilly, and S. P. Amarasinghe, “Opentuner: an extensible framework for program autotuning,” in International Conference on Parallel Architectures and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014, J. N. Amaral and J. Torrellas, Eds. ACM, 2014, pp. 303–316. [Online]. Available: https: //doi.org/10.1145/2628071.2628092 S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, USA. IEEE Computer Society, 2009, pp. 44–54. [Online]. Available: https://doi.org/10.1109/IISWC.2009.5306797 Y. Gu, W. Wu, Y. Li, and L. Chen, “Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus,” CoRR, vol. abs/2007.09822, 2020. [Online]. Available: https://arxiv.org/abs/2007.09822 NVIDIA, “Nvidia cuda,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/cuda-toolkit D. Mustafa, “A survey of performance tuning techniques and tools for parallel applications,” IEEE Access, vol. 10, pp. 15 036–15 055, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3147846 C. Lattner and V. S. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 2004, pp. 75–88. [Online]. Available: https://doi.org/10.1109/CGO.2004.1281665 R. Landaverde, T. Zhang, A. K. Coskun, and M. C. Herbordt, “An investigation of unified memory access performance in CUDA,” in IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9-11, 2014. IEEE, 2014, pp. 1–6. [Online]. Available: https://doi.org/10.1109/ HPEC.2014.7040988 P. Pirkelbauer, P. Lin, T. Vanderbruggen, and C. Liao, “Xplacer: Automatic analysis of data access patterns on heterogeneous CPU/GPU systems,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18-22, 2020. IEEE, 2020, pp. 997–1007. [Online]. Available: https://doi.org/10.1109/IPDPS47924.2020.00106 D. Quinlan and C. Liao, “The ROSE source-to-source compiler infrastructure,” in Cetus users and compiler infrastructure workshop, in conjunction with PACT, vol. 2011. Citeseer, 2011, p. 1. A. Basumallik and R. Eigenmann, “Optimizing irregular shared-memory applications for distributed-memory systems,” in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2006, New York, New York, USA, March 29-31, 2006, J. Torrellas and S. Chatterjee, Eds. ACM, 2006, pp. 119–128. [Online]. Available: https://doi.org/10.1145/1122971.1122990 T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, “Automatic CPU-GPU communication management and optimization,” in Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, M. W. Hall and D. A. Padua, Eds. ACM, 2011, pp. 142–151. [Online]. Available: https://doi.org/10.1145/1993498.1993516 T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August, “Dynamically managed data for CPU-GPU architectures,” in 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2012, San Jose, CA, USA, March 31 - April 04, 2012, C. Eidt, A. M. Holler, U. Srinivasan, and S. P. Amarasinghe, Eds. ACM, 2012, pp. 165–174. [Online]. Available: https://doi.org/10.1145/2259016.2259038 S. Che, J. W. Sheaffer, and K. Skadron, “Dymaxion: optimizing memory access patterns for heterogeneous systems,” in Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, S. A. Lathrop, J. Costa, and W. Kramer, Eds. ACM, 2011, pp. 13:1–13:11. [Online]. Available: https://doi.org/10.1145/2063384.2063401 M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA pascal and volta gpus,” J. Supercomput., vol. 75, no. 11, pp. 7625–7645, 2019. [Online]. Available: https://doi.org/10.1007/s11227-019-02966-8 NVIDIA, “Nvidia nsight toolkit,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/nsight-eclipse-edition L. Stoltzfus, M. Emani, P. Lin, and C. Liao, “Data placement optimization in GPU memory hierarchy using predictive modeling,” in Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. ACM, 2018, pp. 45–49. [Online]. Available: https://doi.org/10.1145/3286475.3286482 NVIDIA, “Nvidia nvprofiler,” 2022, accessed Jul. 8, 2022. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide/index.html L. Li and B. M. Chapman, “Compiler assisted hybrid implicit and explicit GPU memory management under unified address space,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, M. Taufer, P. Balaji, and A. J. Peña, Eds. ACM, 2019, pp. 51:1–51:16. [Online]. Available: https://doi.org/10.1145/3295500.3356141 P. Barua, J. Zhao, and V. Sarkar, “Ompmemopt: Optimized memory movement for heterogeneous computing,” in Euro-Par 2020: Parallel Processing - 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, August 24-28, 2020, Proceedings, ser. Lecture Notes in Computer Science, M. Malawski and K. Rzadca, Eds., vol. 12247. Springer, 2020, pp. 200–216. [Online]. Available: https://doi.org/10.1007/978-3-030-57675-2_13 J. Gómez-Luna, I. E. Hajj, L. Chang, V. Garcia-Flores, S. G. D. Gonzalo, T. B. Jablin, A. J. Peña, and W. W. Hwu, “Chai: Collaborative heterogeneous applications for integrated-architectures,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2017, Santa Rosa, CA,USA, April 24-25, 2017. IEEE Computer Society, 2017, pp. 43–54. [Online]. Available: https://doi.org/10.1109/ISPASS.2017.7975269 R. Nambiar and M. Poess, Eds., Performance Evaluation and Benchmarking for the Era of Artificial Intelligence - 10th TPC Technology Conference, TPCTC 2018, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers, ser. Lecture Notes in Computer Science, vol. 11135. Springer, 2019. [Online]. Available: https://doi.org/10.1007/978-3-030-11404-6
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86785-
dc.description.abstract圖形處理器為一種可高度平行運算之計算設備,相較於中央處理器能,能更有效率的平行處理大量的資料。然而,圖形處理器以及中央處理器間的異質記憶體架構造成資料搬運的成本,降低了圖形處理器的效能。 輝達在統一記憶體架構 6.0 版本中,提出了統一記憶體的方法。中央處理器以及圖形處理器之記憶體被視為在同一個記憶體空間中。這項技術減輕了程式設計師的負擔,讓使用者不必再人工管理繁雜的記憶體搬運。然而統一記憶體架構可能會造成效能低下,即使輝達提供了進階應用程式介面,讓程式設計師可以傳遞記憶體建議給統一記憶體驅動,如何選擇正確的記憶體建議卻也相當困難。 在這篇論文中,我們提出了一個輕量化的自適應框架,用以找出對於各個應用程式中最合適的統一記憶體建議。實驗結果顯示我們的方法最高可以提升圖形處理器至多12倍的總體運算效能。zh_TW
dc.description.abstractA Graphics Processing Unit (GPU) is a highly parallel computing device that processes large blocks of data more efficiently than a Compute Processing Unit (CPU). However, the heterogeneity of memory between GPU and CPU could lead to significant data movement overhead. In CUDA (Compute Unified Device Architecture) 6.0, NVIDIA proposed the Unified Memory (UM) method, and hence the memory of CPU and GPU is seen as a single memory scope. This technique relieves programmers' burden of manually managing tedious data movement between CPU and GPU. However, UM could lead to low efficiency and huge performance degradation. Although NVIDIA has provided advanced APIs for programmers to pass memory hints to the UM driver, it is still a problem to choose the right UM advice. In this thesis, we proposed a lightweight auto-tuning framework to find the optimal UM advice for each application. Results show that our approach can achieve up to 12x speedup in overall GPU performance in the nw application.en
dc.description.provenanceMade available in DSpace on 2023-03-20T00:17:39Z (GMT). No. of bitstreams: 1
U0001-2207202213470700.pdf: 5675250 bytes, checksum: ccf851c2c8630c39a1cc9537571415f8 (MD5)
Previous issue date: 2022
en
dc.description.tableofcontents誌謝 i 摘要 ii Abstract iii Contents iv List of Figures vi List of Tables vii Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 GPU Architecture 6 2.2 Compute Unified Device Architecture 6 2.3 Explicit Data Movement and Unified Memory 7 2.4 UM Advanced API 8 2.5 Auto-tuning 10 2.6 LLVM 11 Chapter 3 Related Works 14 Chapter 4 Proposed Method 18 4.1 Design Overview 18 4.2 UM Manipulator 19 4.2.1 Source Translation 19 4.2.2 LLVM Transformation 21 4.3 UM Tuner 22 Chapter 5 Experiments and Results 24 5.1 Experiment Settings 24 5.2 Benchmark 24 5.2.1 The Rodinia Benchmark Suite 24 5.2.2 Other Benchmark Suite 26 5.3 Tuning Parameters 27 5.4 GPU Activity Profiling 27 5.5 Performance Improvement 29 5.6 Oversubscription 31 5.7 cudaMemAdvise and cudaMemPrefetchAsync 33 5.8 Tuning Convergence 33 5.9 Discussion 34 Chapter 6 Conclusion and Future Works 36 References 38 2.1 Three-phase Design 12 4.1 Design Overview 19 4.2 LLVM Code Sample 22 5.1 GPU Activity Profiling 28 5.2 Kernel Execution Speedup 29 5.3 Memory Movement Time 30 5.4 Overall Speedup 31 5.5 gesummv Execution Time in Different Input Size 32 5.6 Configuration Comparison 32 5.7 nw Tuning Performance 33 5.8 gesummv Tuning Performance 34 5.1 Input argument for each application 27 5.2 Tuning Space 27
dc.language.isoen
dc.subject編譯器zh_TW
dc.subject自動調節zh_TW
dc.subject圖形處理器zh_TW
dc.subject異構計算zh_TW
dc.subject編譯器zh_TW
dc.subject統一記憶體架構zh_TW
dc.subject自動調節zh_TW
dc.subject圖形處理器zh_TW
dc.subject異構計算zh_TW
dc.subject統一記憶體架構zh_TW
dc.subjectGPUen
dc.subjectheterogeneous computingen
dc.subjectauto-tuningen
dc.subjectcompileren
dc.subjectunified memoryen
dc.subjectheterogeneous computingen
dc.subjectGPUen
dc.subjectauto-tuningen
dc.subjectcompileren
dc.subjectunified memoryen
dc.title統一計算架構自動調諧框架zh_TW
dc.titleAuto-Tuning Framework for CUDA Unified Memoryen
dc.typeThesis
dc.date.schoolyear110-2
dc.description.degree碩士
dc.contributor.oralexamcommittee顏嗣鈞(Hsu-Chun Yen),雷欽隆(Chin-Laung Lei),陳英一(Ing-Yi Chen),袁世一(Shih-Yi Yuan)
dc.subject.keyword異構計算,圖形處理器,自動調節,編譯器,統一記憶體架構,zh_TW
dc.subject.keywordheterogeneous computing,,GPU,auto-tuning,compiler,unified memory,en
dc.relation.page43
dc.identifier.doi10.6342/NTU202201635
dc.rights.note同意授權(全球公開)
dc.date.accepted2022-07-26
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
dc.date.embargo-lift2022-08-10-
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
U0001-2207202213470700.pdf5.54 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved