針對巨量資料分析的異質計算提升加速器的使用率與吞吐量

Yu-Chen Wu; 吳宇宸

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78297

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	郭大維(Tei-Wei Kuo)
dc.contributor.author	Yu-Chen Wu	en
dc.contributor.author	吳宇宸	zh_TW
dc.date.accessioned	2021-07-11T14:49:58Z	-
dc.date.available	2025-08-08
dc.date.copyright	2020-09-14
dc.date.issued	2020
dc.date.submitted	2020-08-10
dc.identifier.citation	[1] AutoKeras. https://autokeras.com/. [2] AutoML. http://www.ml4aad.org/automl/. [3] BittWare. http://www.bittware.com. [4] Docker. https://www.docker.com/. [5] FIMI Repository. http://fimi.ua.ac.be/data/. [6] Kubernetes. https://kubernetes.io/. [7] Nallatech. http://www.nallatech.com. [8] NVIDIA Data Loading Library (DALI). https://developer.nvidia.com/DALI. [9] NVIDIA Deep Learning Frameworks. https://docs.nvidia.com/deeplearning/frameworks/index.html. [10] NVIDIA NCCL, Performance reported by NCCL tests. https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md. [11] NVML API Reference Guide vR440, November 2019. https://docs.nvidia.com/pdf/NVML_API_Reference_Guide.pdf. [12] PLDA. http://www.plda.com. [13] The simulator of GPU topology-aware scheduler. https://github.com/HiEST/gpu-topo-aware. [14] Top 500 supercomputer sites. http://www.top500.org/. [15] WebDocs: a real-life huge transactional dataset. http://hpc.isti.cnr.it/~claudio/papers/2004_FIMI_webdocs.pdf. [16] IBM Liquid Metal, 2013. http://researcher.watson.ibm.com/researcher/view_project.php?id=122. [17] M. Amaral, J. Polo, D. Carrera, S. Seelam, and M. Steinder. Topology-aware gpu scheduling for learning workloads in cloud environments. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, New York, NY, USA, 2017. Association for Computing Machinery. [18] Amazon. Amazon EC2 P3 Instance. https://aws.amazon.com/ec2/instance-types/p3/?nc1=h_l/. [19] T. Baker. A stack-based resource allocation policy for realtime processes. In Proceedings 11th Real-Time Systems Symposium, pages 191–200. IEEE, 1990. [20] A. Block, H. Leontyev, B. B. Brandenburg, and J. H. Anderson. A Flexible Real-Time Locking Protocol for Multiprocessors. In 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pages 47–56. IEEE, Aug. 2007. [21] C. Borgelt. An implementation of the fp-growth algorithm. In Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, OSDM ’05, pages 1–5, New York, NY, USA, 2005. ACM. [22] K.-W. Chon, S.-H. Hwang, and M.-S. Kim. Gminer: A fast gpu-based frequent itemset mining method for large-scale data. Information Sciences, 439-440:19 – 38, 2018. [23] CUB. CUB Library. http://nvlabs.github.io/cub. [24] G. Elliott and J. H. Anderson. An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems. RTNS, 2011. [25] F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocation onto a hypercube by recursive mincut bipartitioning. In Proceedings of the Third Conference on Hyper-cube Concurrent Computers and Applications: Architecture, Software, Computer Systems, and General Issues - Volume 1, C3P, page 210–221, New York, NY, USA, 1988. Association for Computing Machinery. [26] W. Fang, M. Lu, X. Xiao, B. He, and Q. Luo. Frequent itemset mining on graphics processors. In Proceedings of the Fifth International Workshop on Data Management on New Hardware, DaMoN ’09, pages 34–42, New York, NY, USA, 2009. ACM. [27] P. Gai, L. Abeni, and G. Buttazzo. Multiprocessor DSP scheduling in system-on-a-chip architectures. In Proceedings 14th Euromicro Conference on Real-Time Systems. Euromicro RTS 2002, pages 231–238. IEEE Comput. Soc, 2002. [28] P. Gai, M. Di Natale, G. Lipari, A. Ferrari, C. Gabellini, and P. Marceca. A comparison of mpcp and msrp when sharing resources in the janus multiple-processor on a chip platform. In Real-Time and Embedded Technology and Applications Symposium, 2003. Proceedings. The 9th IEEE, pages 189–198, 2003. [29] P. Gai, G. Lipari, and M. Di Natale. Minimizing memory utilization of real-time task sets in single and multi-processor systems-on-a-chip. In Proceedings 22nd IEEE Real-Time Systems Symposium, pages 73–83. IEEE Comput. Soc, 2001. [30] M. Goldfarb, Y. Jo, and M. Kulkarni. General transformations for gpu execution of tree traversals. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 10:1–10:12, New York, NY, USA, 2013. ACM. [31] J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485–500, Boston, MA, Feb. 2019. USENIX Association. [32] J. Guo, Z. Chang, S. Wang, H. Ding, Y. Feng, L. Mao, and Y. Bao. Who limits the resource efficiency of my datacenter: An analysis of alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service, IWQoS ’19, New York, NY, USA, 2019. Association for Computing Machinery. [33] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2):1–12, May 2000. [34] H. P. Hofstee. The Big Deal about Big Data - a perspective from IBM Research, 2013. http://www.nas-conference.org/NAS-2013/conference%20speech/NAS%20XiAn%20Big%20Data.pdf. [35] X. Huang, C. I. Rodrigues, S. Jones, I. Buck, and W. m. Hwu. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines. In 2010 10th IEEE International Conference on Computer and Information Technology, pages 1134–1139, June 2010. [36] S. Jeaugey. GTC Silicon Valley-2019 ID:S9656:Distributed Training and Fast inter-GPU Communication with NCCL. https://developer.nvidia.com/gtc/2019/video/S9656/video. [37] H. Jiang and H. Meng. A parallel fp-growth algorithm based on gpu. In 2017 IEEE 14th International Conference on e-Business Engineering (ICEBE), pages 97–102, Nov 2017. [38] S. Kato, K. Lakshmanan, Y. Ishikawa, and R. Rajkumar. Resource sharing in gpu-accelerated windowing systems. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2011 17th IEEE, pages 191–200, 2011. [39] A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110, 2020. [40] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. Pfp: Parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys ’08, pages 107–114, New York, NY, USA, 2008. ACM. [41] C. L. Liu and J. W. Layland. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, 20(1):46–61, Jan. 1973. [42] J. W. S. Liu. Real-Time Systems. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2000. [43] C. Locke, D. Vogel, and T. Mesler. Building a predictable avionics platform in ada: a case study. In Real-Time Systems Symposium, 1991. Proceedings., Twelfth, pages 181–189, 1991. [44] K. Mahajan, A. Balasubramanian, A. Singhvi, S. Venkataraman, A. Akella, A. Phanishayee, and S. Chawla. Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, Santa Clara, CA, Feb. 2020. USENIX Association. [45] Microsoft. Microsoft Azure N-Series. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/. [46] M. Nuyens and A. Wierman. The foreground-background queue: A survey. Perform. Eval., 65(3–4):286–307, Mar. 2008. [47] NVIDIA. NVIDIA Nsight. http://www.nvidia.com/object/nsight.html. [48] NVIDIA. NVIDIA NVLink. https://www.nvidia.com/en-us/data-center/nvlink/. [49] NVIDIA. Scaling Deep Learning Training with NCCL. https://devblogs.nvidia.com/scaling-deep-learning-training-nccl/. [50] NVIDIA. The Datasheet of Tesla Kepler-Family. https://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf. [51] NVIDIA. The specification of NVIDIA Geforce GTX1070. https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1070-ti/. [52] NVIDIA. The specification of NVIDIA Geforce GTX980 Ti. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980-ti/specifications. [53] F. Pellegrini. SCOTCH 5.1 User’s guide. https://www.labri.fr/perso/pelegrin/scotch/. [54] Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery. [55] Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 16–29, New York, NY, USA, 2019. Association for Computing Machinery. [56] R. Rajkumar, L. Sha, and J. Lehoczky. Real-time synchronization protocols for multiprocessors. In Proceedings. Real-Time Systems Symposium, pages 259–269. IEEE Comput. Soc. Press, 1988. [57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [58] L. Sha, R. Rajkumar, and J. Lehoczky. Priority inheritance protocols: an approach to real-time synchronization. IEEE Transactions on Computers, 39(9):1175-1185,1990. [59] M. Steinberger, M. Kenzel, B. Kainz, and D. Schmalstieg. Scatteralloc: Massively parallel dynamic memory allocation for the gpu. In 2012 Innovative Parallel Computing (InPar), pages 1–10, May 2012. [60] J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G.D.Liu, and W. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012. [61] F. Wang and B. Yuan. Parallel frequent pattern mining without candidate generation on gpus. In 2014 IEEE International Conference on Data Mining Workshop, pages 1046–1052, Dec 2014. [62] K. Wang, L. Tang, J. Han, and J. Liu. Top down fp-growth for association rule mining. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’02, pages 334–340, London, UK, UK, 2002. Springer-Verlag. [63] W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, Carlsbad, CA, Oct. 2018. USENIX Association. [64] F. Zhang, P. Di, H. Zhou, X. Liao, and J. Xue. Regtt: Accelerating tree traversals on gpus by exploiting regularities. In 2016 45th International Conference on Parallel Processing (ICPP), pages 562–571, Aug 2016. [65] F. Zhang, Y. Zhang, and J. Bakos. Gpapriori: Gpu-accelerated frequent itemset mining. In 2011 IEEE International Conference on Cluster Computing, pages 590–594, Sept 2011.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78297	-
dc.description.abstract	異質計算提供了Big Data和人工智慧在效能、成本和功耗上很大的改善空間。各種加速器被設計出與通用的中央處理器（CPU）協同處理大量的資料。然而各硬體架構的限制與軟體的設計使得效能瓶頸依然存在。本論文分析加速器的低使用率議題與提出解決方案，以求更好地利用加速器於大數據分析。首先我們探討即時系統中同步協定對於加速器使用率的影響並提出改善方案來提高其使用率，同時我們保證即時系統的性質。第二部份我們探討大數據分析中，演算法無法妥善利用圖形處理器(GPU)的問題。我們以經典的頻繁樣式探勘演算法-FP-growth當作研究案例，提出了適合GPU的資料結構以及演算法，藉此消除大量記憶體配置的開銷。最後，我們進一步探討多GPU系統中，在考量GPU的拓樸下如何有效地使用GPU。我們針對多工環境下的深度學習訓練，提出共享多GPU系統的排程機制以達到最小化平均工作完成時間的目的。本論文中的解決方案經由實驗與分析，均證實了對於聲稱的目標有顯著的效果。	zh_TW
dc.description.abstract	Heterogeneous computing provides tremendous opportunities in performance, cost, and energy optimizations to Big Data and Artificial Intelligence applications. Various accelerators, such as GPU, or hardware architectures are designed to work together with general-purpose CPUs in large-scaled data processing. However, there are still inevitable processing bottlenecks between hardware components, due to architecture constraints and applications’ designs and behaviors. This dissertation is to address the utilization issues and their solutions to better utilize accelerators in large-scaled data processing. Frist, we exploit synchronization protocols for accelerators to improve the accelerator utilization as well as to guarantee the real-time requirements of the system. In the second part of the dissertation, we then explored the GPU-utilization problems in running algorithms behind Big Data processing. The classical FP-growth frequent pattern mining algorithm was taken as an example in the study, and a GPU-friendly algorithm was proposed by transforming recursive function calls into iterative ones and also by minimizing massive dynamic memory allocations. In the third part of the dissertation, we further explored both the GPU topology of servers and how effectiveness GPUs could be utilized by applications. A scheduling policy is presented for users in sharing GPU-powered servers for deep learning workloads, with an objective to minimize the average job completion time. The proposed solutions in this dissertation were all verified by experiments and/or analysis so as to show the effectiveness in resolving each respectively identified problem.	en
dc.description.provenance	Made available in DSpace on 2021-07-11T14:49:58Z (GMT). No. of bitstreams: 1 U0001-0808202017441400.pdf: 2764857 bytes, checksum: 50d4526f7072eb980ff80e6d916feddb (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝 iii 摘要 v Abstract vii 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background and Related work . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Synchronization Protocols for Heterogeneous Computing . . . . . 2 1.2.2 GPU-accelerated Frequent Pattern Mining Algorithms . . . . . . 3 1.2.3 GPU Scheduling for Deep Learning Training Systems . . . . . . 5 1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Accelerator-Aware Task Synchronization for Real-Time Systems 9 2.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Accelerator-Aware Task Synchronization Protocol . . . . . . . . . . . . . 13 2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Accelerator Locking with Priority Bars . . . . . . . . . . . . . . 14 2.2.3 Semaphore Locking with Priority Ceilings . . . . . . . . . . . . . 16 2.2.4 Tradeoff between Task Blocking and Accelerator Utilization . . . 19 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Fast Frequent Pattern Mining without Candidate Generations on GPU by Low Latency Memory Allocation 27 3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 The FP-growth Algorithm and its Variation . . . . . . . . . . . . 29 3.1.2 Challenges of Accelerating FP-growth with GPUs . . . . . . . . 31 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 FP-tree Reorganization . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Iterative Algorithm and Collective Memory Allocation . . . . . . 39 3.2.3 The Design of Header Tables . . . . . . . . . . . . . . . . . . . . 42 3.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3 Discussion of Overheads . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Cybertron:A Topology-Aware GPU Scheduler for DNN training system with Consideration of Cost-effectiveness 55 4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.1 The characteristics of Deep Learning Training(DLT) Job . . . . . 55 4.1.2 The evolution of system for deep learning training . . . . . . . . 56 4.1.3 Ring-based collective communication . . . . . . . . . . . . . . . 57 4.1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 The design of Cybertron . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.4 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.3 The results on the real server . . . . . . . . . . . . . . . . . . . . 74 4.3.4 The results of simulation . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Conclusion 83 Bibliography 85 Publication List 93
dc.language.iso	en
dc.subject	資料探勘	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	異質計算	zh_TW
dc.subject	圖形處理器	zh_TW
dc.subject	即時系統	zh_TW
dc.subject	deep learning	en
dc.subject	Real-time system	en
dc.subject	GPU	en
dc.subject	data mining	en
dc.subject	heterogeneous computing	en
dc.title	針對巨量資料分析的異質計算提升加速器的使用率與吞吐量	zh_TW
dc.title	Increasing Utilization and Throughput of Accelerators in Heterogeneous Computing for Big Data Analytics	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	博士
dc.contributor.coadvisor	葉彌妍(Mi-Yen Yeh)
dc.contributor.oralexamcommittee	楊佳玲(Chia-Lin Yang),林守德(Shou-De Lin),施吉昇(Chi-Sheng Shih),洪士灝(Shih-Hao Hung)
dc.subject.keyword	即時系統,圖形處理器,資料探勘,異質計算,深度學習,	zh_TW
dc.subject.keyword	Real-time system,GPU,data mining,heterogeneous computing,deep learning,	en
dc.relation.page	93
dc.identifier.doi	10.6342/NTU202002686
dc.rights.note	有償授權
dc.date.accepted	2020-08-10
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
dc.date.embargo-lift	2025-08-08	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-0808202017441400.pdf 未授權公開取用	2.7 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。