多圖形處理器環境下優化深度學習網絡的管線平行訓練

吳秉柔; Bing-Jou Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92985

標題:	多圖形處理器環境下優化深度學習網絡的管線平行訓練 Execution Time Optimization for Pipeline Deep Network Training on Multiple GPUs
作者:	吳秉柔 Bing-Jou Wu
指導教授:	劉邦鋒 Pangfeng Liu
關鍵字:	管線平行訓練,動態規劃,圖分割,深度神經網路,機器學習, pipeline parallel,dynamic programming,graph partitioning,deep neural network,machine learning,
出版年 :	2024
學位:	碩士
摘要:	隨著神經網絡模型愈加龐大，對訓練時間和記憶體的需求亦隨之增長。為了迎合這些需求，先進的平行計算技術已變得極為關鍵。我的研究聚焦於混合平行性，這是基於管線平行性的一種拓展方法。管線平行性技術將神經網絡拆分為若干子網絡，並將這些子網絡分配到多個處理單元上，讓每個裝置能同步處理不同的資料段。混合平行性進一步將這一理念拓寬，為每個子網絡分配多個設備。我的研究旨在通過改良模型的拆分方式及計算裝置的分配策略來優化混合平行性。我首先將神經網絡抽象為一個由張量操作符構成的有向無環圖，接著證明對這樣的圖進行最優拆分是一個 NP 完全問題。為此，我提出了一種兩步驟方法：首先確定節點的排列順序，其次通過動態規劃技術將這些節點分割，以確保各設備之間能夠保持負載均衡。在將圖轉化為節點序列的過程中，我探索了兩種主要方法：一種是基於拓撲排序，另一種則是對非連續子圖進行聚集。我對這兩種方法進行了實施與比較，以挑選出性能更佳的方案。實驗表明，我的算法能顯著提升模型的分割效率和訓練吞吐量，其中分割時間的加速比達到了23倍，而訓練吞吐量也提高了1.3倍。 As neural network models become gigantic, they increasingly demand more time and memory for training. To meet these demands, advanced parallel computing techniques have become essential. This research focuses on hybrid parallelism, an extension of pipeline parallelism. Pipeline parallelism splits the neural network into sub-networks distributed across a sequence of processing units, enabling simultaneous processing of different data segments on each device. Hybrid parallelism extends this concept by allocating multiple devices to each sub-network. This research focuses on optimizing hybrid parallelism by improving how the model is partitioned and how computational devices are assigned. I address these issues by modeling the neural network as a directed acyclic graph of tensor operators, and then demonstrating that optimally partitioning this graph is NP-complete. Then, I propose a two-step approach. The first step is to determine a sequence of nodes. The second step is dynamic programming, which partitions the sequence to maintain balance across the assigned devices. In transforming the graph into a sequence, I explore two methods: one employs topological sorting, while the other clusters non-sequential subgraphs. I apply both methods and select the more effective one based on performance outcomes. I implement my algorithm and conduct experiments. The results show substantial enhancements in both the speed of partitioning and training throughput, with speedups reaching up to 23 in partitioning time and a 1.3-fold increase in training throughput.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92985
DOI:	10.6342/NTU202401090
全文授權:	未授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	1.2 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。