多圖形處理器環境下優化深度學習網絡的管線平行訓練

吳秉柔; Bing-Jou Wu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92985

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	劉邦鋒	zh_TW
dc.contributor.advisor	Pangfeng Liu	en
dc.contributor.author	吳秉柔	zh_TW
dc.contributor.author	Bing-Jou Wu	en
dc.date.accessioned	2024-07-12T16:08:15Z	-
dc.date.available	2024-07-13	-
dc.date.copyright	2024-07-12	-
dc.date.issued	2024	-
dc.date.submitted	2024-07-07	-
dc.identifier.citation	References Y. Akhremtsev, P. Sanders, and C. Schulz. High-quality shared-memory graph partitioning. IEEE Transactions on Parallel and Distributed Systems, 31(11):2710–2722, Nov 2020. M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte. Min-max heaps and generalized priority queues. Commun. ACM, 29(10):996–1000, oct 1986. S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia, L. Diao, X. Liu, and W. Lin. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 431–445, New York, NY, USA, 2021. Association for Computing Machinery. W. Gong, X. Lai, Q. Yao, and S. Zheng. A variable neighborhood search algorithm for the balanced k-way partitioning problem with weight constraints. In 2021 China Automation Congress (CAC), pages 1106–1110, Oct 2021. M. Hasanzadeh Mofrad, R. Melhem, and M. Hammoud. Revolver: Vertex-centric graph partitioning using reinforcement learning. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 818–821, July 2018. B. Hendrickson and R. Leland. A multi-level algorithm for partitioning graphs. In Supercomputing ’95:Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, pages 28–28, Dec 1995. J. Herrmann, J. Kho, B. Uçar, K. Kaya, and U. V. Çatalyürek. Acyclic partitioning of large directed acyclic graphs. In 2017 17th IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing (CCGRID), pages 371–380, May 2017. J. Herrmann, M. Y. Özkaya, B. Uçar, K. Kaya, and U. V. Çatalyürek. Multilevel algorithms for acyclic partitioning of directed acyclic graphs. SIAM Journal on Scientific Computing, 41(4):A2117–A2145, 2019. A. B. Kahn. Topological sorting of large networks. Commun. ACM, 5(11):558–562, nov 1962. A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V, page 491–507, Berlin, Heidelberg, 2020. Springer-Verlag. Y. Li, W. Yang, Z. Zhong, and Y. Xu. Nolgp: A novel optimized large-scale graph partitioning algorithm. In 2019 15th International Conference on Computational Intelligence and Security (CIS), pages 127–131, Dec 2019. P. Liang, Y. Tang, X. Zhang, Y. Bai, T. Su, Z. Lai, L. Qiao, and D. Li. A survey on auto-parallelism of large-scale deep learning training. IEEE Transactions on Parallel and Distributed Systems, 34(8):2377–2390, Aug 2023. W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li. Autopipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing. In 2022 IEEE International Conference on Cluster Computing (CLUSTER), pages 301–312, Sep. 2022. V. Mokashi and D. B. Kulkarni. A review: Scalable parallel graph partitioning for complex networks. In 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pages 1869–1871, June 2018. O. Moreira, M. Popp, and C. Schulz. Evolutionary multi-level acyclic graph partitioning. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18, page 332–339, New York, NY, USA, 2018. Association for Computing Machinery. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery. D. Patel and G. Wong. Gpt-4 architecture, infrastructure, training dataset, costs, vision, moe. https://www.semianalysis.com/p/gpt-4-architecture-infrastructure, 2023. Accessed: April 26, 2024. A. S. Pope, D. R. Tauritz, and A. D. Kent. Evolving multi-level graph partitioning algorithms. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–8, Dec 2016. M. Tanaka, K. Taura, T. Hanawa, and K. Torisawa. Automatic graph partitioning for very large-scale deep learning. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1004–1013, May 2021. B. Wu, Z. Xiao, P. Lin, Z. Tang, and K. Li. Critical path awareness techniques for large-scale graph partitioning. IEEE Transactions on Sustainable Computing, 8(3):412–422, July 2023. U. V. Çatalyürek, M. Deveci, K. Kaya, and B. Uçar. Multithreaded clustering for multi-level hypergraph partitioning. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 848–859, May 2012.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92985	-
dc.description.abstract	隨著神經網絡模型愈加龐大，對訓練時間和記憶體的需求亦隨之增長。為了迎合這些需求，先進的平行計算技術已變得極為關鍵。我的研究聚焦於混合平行性，這是基於管線平行性的一種拓展方法。管線平行性技術將神經網絡拆分為若干子網絡，並將這些子網絡分配到多個處理單元上，讓每個裝置能同步處理不同的資料段。混合平行性進一步將這一理念拓寬，為每個子網絡分配多個設備。我的研究旨在通過改良模型的拆分方式及計算裝置的分配策略來優化混合平行性。我首先將神經網絡抽象為一個由張量操作符構成的有向無環圖，接著證明對這樣的圖進行最優拆分是一個 NP 完全問題。為此，我提出了一種兩步驟方法：首先確定節點的排列順序，其次通過動態規劃技術將這些節點分割，以確保各設備之間能夠保持負載均衡。在將圖轉化為節點序列的過程中，我探索了兩種主要方法：一種是基於拓撲排序，另一種則是對非連續子圖進行聚集。我對這兩種方法進行了實施與比較，以挑選出性能更佳的方案。實驗表明，我的算法能顯著提升模型的分割效率和訓練吞吐量，其中分割時間的加速比達到了23倍，而訓練吞吐量也提高了1.3倍。	zh_TW
dc.description.abstract	As neural network models become gigantic, they increasingly demand more time and memory for training. To meet these demands, advanced parallel computing techniques have become essential. This research focuses on hybrid parallelism, an extension of pipeline parallelism. Pipeline parallelism splits the neural network into sub-networks distributed across a sequence of processing units, enabling simultaneous processing of different data segments on each device. Hybrid parallelism extends this concept by allocating multiple devices to each sub-network. This research focuses on optimizing hybrid parallelism by improving how the model is partitioned and how computational devices are assigned. I address these issues by modeling the neural network as a directed acyclic graph of tensor operators, and then demonstrating that optimally partitioning this graph is NP-complete. Then, I propose a two-step approach. The first step is to determine a sequence of nodes. The second step is dynamic programming, which partitions the sequence to maintain balance across the assigned devices. In transforming the graph into a sequence, I explore two methods: one employs topological sorting, while the other clusters non-sequential subgraphs. I apply both methods and select the more effective one based on performance outcomes. I implement my algorithm and conduct experiments. The results show substantial enhancements in both the speed of partitioning and training throughput, with speedups reaching up to 23 in partitioning time and a 1.3-fold increase in training throughput.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-12T16:08:15Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-07-12T16:08:15Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i 摘要 ii Abstract iii Contents v List of Figures vii Chapter 1 Introduction 1 Chapter 2 Related Works 3 2.1 Auto-Parallelism 3 2.1.1 Parallelism Schemes 3 2.1.2 Device Topology 4 2.1.3 Performance Evaluation Methods 4 2.1.4 Strategy Searching Methods 4 2.2 Hybrid Parallelism 5 2.3 Graph Partitioning 6 Chapter 3 Background 8 Chapter 4 Graph Partition Problem 11 4.1 Problem Definition 11 4.2 NP-Completeness 11 4.3 Transforming Graph into Sequence 12 4.3.1 Subgraph Contraction 12 4.3.2 Topological Sorting 13 4.4 Hybrid Parallel Partition 13 Chapter 5 Experiment 19 5.1 Experimental Setup 19 5.1.1 Implementation Details 19 5.1.2 Hardware Configuration 19 5.1.3 Baseline 19 5.1.4 Training Settings 20 5.1.5 Profiling and Preprocessing 21 5.1.6 Models 21 5.2 Analysis of Throughput 23 5.2.1 Granularity of Partitioning 23 5.2.2 Communication Time 23 5.3 Analysis of Partition Times 25 Chapter 6 Conclusion 28 References 29	-
dc.language.iso	en	-
dc.subject	動態規劃	zh_TW
dc.subject	管線平行訓練	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	深度神經網路	zh_TW
dc.subject	圖分割	zh_TW
dc.subject	graph partitioning	en
dc.subject	deep neural network	en
dc.subject	machine learning	en
dc.subject	dynamic programming	en
dc.subject	pipeline parallel	en
dc.title	多圖形處理器環境下優化深度學習網絡的管線平行訓練	zh_TW
dc.title	Execution Time Optimization for Pipeline Deep Network Training on Multiple GPUs	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	洪鼎詠;吳真貞	zh_TW
dc.contributor.oralexamcommittee	Ding-Yong Hong;Jan-Jan Wu	en
dc.subject.keyword	管線平行訓練,動態規劃,圖分割,深度神經網路,機器學習,	zh_TW
dc.subject.keyword	pipeline parallel,dynamic programming,graph partitioning,deep neural network,machine learning,	en
dc.relation.page	32	-
dc.identifier.doi	10.6342/NTU202401090	-
dc.rights.note	未授權	-
dc.date.accepted	2024-07-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-112-2.pdf Restricted Access	1.2 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets