Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92985
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor劉邦鋒zh_TW
dc.contributor.advisorPangfeng Liuen
dc.contributor.author吳秉柔zh_TW
dc.contributor.authorBing-Jou Wuen
dc.date.accessioned2024-07-12T16:08:15Z-
dc.date.available2024-07-13-
dc.date.copyright2024-07-12-
dc.date.issued2024-
dc.date.submitted2024-07-07-
dc.identifier.citationReferences
Y. Akhremtsev, P. Sanders, and C. Schulz. High-quality shared-memory graph partitioning. IEEE Transactions on Parallel and Distributed Systems, 31(11):2710–2722, Nov 2020.
M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte. Min-max heaps and generalized priority queues. Commun. ACM, 29(10):996–1000, oct 1986.
S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia, L. Diao, X. Liu, and W. Lin. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 431–445, New York, NY, USA, 2021. Association for Computing Machinery.
W. Gong, X. Lai, Q. Yao, and S. Zheng. A variable neighborhood search algorithm for the balanced k-way partitioning problem with weight constraints. In 2021 China Automation Congress (CAC), pages 1106–1110, Oct 2021.
M. Hasanzadeh Mofrad, R. Melhem, and M. Hammoud. Revolver: Vertex-centric graph partitioning using reinforcement learning. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 818–821, July 2018.
B. Hendrickson and R. Leland. A multi-level algorithm for partitioning graphs. In Supercomputing ’95:Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, pages 28–28, Dec 1995.
J. Herrmann, J. Kho, B. Uçar, K. Kaya, and U. V. Çatalyürek. Acyclic partitioning of large directed acyclic graphs. In 2017 17th IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing (CCGRID), pages 371–380, May 2017.
J. Herrmann, M. Y. Özkaya, B. Uçar, K. Kaya, and U. V. Çatalyürek. Multilevel algorithms for acyclic partitioning of directed acyclic graphs. SIAM Journal on Scientific Computing, 41(4):A2117–A2145, 2019.
A. B. Kahn. Topological sorting of large networks. Commun. ACM, 5(11):558–562, nov 1962.
A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V, page 491–507, Berlin, Heidelberg, 2020. Springer-Verlag.
Y. Li, W. Yang, Z. Zhong, and Y. Xu. Nolgp: A novel optimized large-scale graph partitioning algorithm. In 2019 15th International Conference on Computational Intelligence and Security (CIS), pages 127–131, Dec 2019.
P. Liang, Y. Tang, X. Zhang, Y. Bai, T. Su, Z. Lai, L. Qiao, and D. Li. A survey on auto-parallelism of large-scale deep learning training. IEEE Transactions on Parallel and Distributed Systems, 34(8):2377–2390, Aug 2023.
W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li. Autopipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing. In 2022 IEEE International Conference on Cluster Computing (CLUSTER), pages 301–312, Sep. 2022.
V. Mokashi and D. B. Kulkarni. A review: Scalable parallel graph partitioning for complex networks. In 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pages 1869–1871, June 2018.
O. Moreira, M. Popp, and C. Schulz. Evolutionary multi-level acyclic graph partitioning. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18, page 332–339, New York, NY, USA, 2018. Association for Computing Machinery.
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery.
D. Patel and G. Wong. Gpt-4 architecture, infrastructure, training dataset, costs, vision, moe. https://www.semianalysis.com/p/gpt-4-architecture-infrastructure, 2023. Accessed: April 26, 2024.
A. S. Pope, D. R. Tauritz, and A. D. Kent. Evolving multi-level graph partitioning algorithms. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–8, Dec 2016.
M. Tanaka, K. Taura, T. Hanawa, and K. Torisawa. Automatic graph partitioning for very large-scale deep learning. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1004–1013, May 2021.
B. Wu, Z. Xiao, P. Lin, Z. Tang, and K. Li. Critical path awareness techniques for large-scale graph partitioning. IEEE Transactions on Sustainable Computing, 8(3):412–422, July 2023.
U. V. Çatalyürek, M. Deveci, K. Kaya, and B. Uçar. Multithreaded clustering for multi-level hypergraph partitioning. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 848–859, May 2012.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92985-
dc.description.abstract隨著神經網絡模型愈加龐大,對訓練時間和記憶體的需求亦隨之增長。為了迎合這些需求,先進的平行計算技術已變得極為關鍵。我的研究聚焦於混合平行性,這是基於管線平行性的一種拓展方法。管線平行性技術將神經網絡拆分為若干子網絡,並將這些子網絡分配到多個處理單元上,讓每個裝置能同步處理不同的資料段。混合平行性進一步將這一理念拓寬,為每個子網絡分配多個設備。
我的研究旨在通過改良模型的拆分方式及計算裝置的分配策略來優化混合平行性。我首先將神經網絡抽象為一個由張量操作符構成的有向無環圖,接著證明對這樣的圖進行最優拆分是一個 NP 完全問題。為此,我提出了一種兩步驟方法:首先確定節點的排列順序,其次通過動態規劃技術將這些節點分割,以確保各設備之間能夠保持負載均衡。
在將圖轉化為節點序列的過程中,我探索了兩種主要方法:一種是基於拓撲排序,另一種則是對非連續子圖進行聚集。我對這兩種方法進行了實施與比較,以挑選出性能更佳的方案。
實驗表明,我的算法能顯著提升模型的分割效率和訓練吞吐量,其中分割時間的加速比達到了23倍,而訓練吞吐量也提高了1.3倍。
zh_TW
dc.description.abstractAs neural network models become gigantic, they increasingly demand more time and memory for training. To meet these demands, advanced parallel computing techniques have become essential. This research focuses on hybrid parallelism, an extension of pipeline parallelism. Pipeline parallelism splits the neural network into sub-networks distributed across a sequence of processing units, enabling simultaneous processing of different data segments on each device. Hybrid parallelism extends this concept by allocating multiple devices to each sub-network.
This research focuses on optimizing hybrid parallelism by improving how the model is partitioned and how computational devices are assigned. I address these issues by modeling the neural network as a directed acyclic graph of tensor operators, and then demonstrating that optimally partitioning this graph is NP-complete. Then, I propose a two-step approach. The first step is to determine a sequence of nodes. The second step is dynamic programming, which partitions the sequence to maintain balance across the assigned devices.
In transforming the graph into a sequence, I explore two methods: one employs topological sorting, while the other clusters non-sequential subgraphs. I apply both methods and select the more effective one based on performance outcomes.
I implement my algorithm and conduct experiments. The results show substantial enhancements in both the speed of partitioning and training throughput, with speedups reaching up to 23 in partitioning time and a 1.3-fold increase in training throughput.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-12T16:08:15Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-07-12T16:08:15Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
摘要 ii
Abstract iii
Contents v
List of Figures vii
Chapter 1 Introduction 1
Chapter 2 Related Works 3
2.1 Auto-Parallelism 3
2.1.1 Parallelism Schemes 3
2.1.2 Device Topology 4
2.1.3 Performance Evaluation Methods 4
2.1.4 Strategy Searching Methods 4
2.2 Hybrid Parallelism 5
2.3 Graph Partitioning 6
Chapter 3 Background 8
Chapter 4 Graph Partition Problem 11
4.1 Problem Definition 11
4.2 NP-Completeness 11
4.3 Transforming Graph into Sequence 12
4.3.1 Subgraph Contraction 12
4.3.2 Topological Sorting 13
4.4 Hybrid Parallel Partition 13
Chapter 5 Experiment 19
5.1 Experimental Setup 19
5.1.1 Implementation Details 19
5.1.2 Hardware Configuration 19
5.1.3 Baseline 19
5.1.4 Training Settings 20
5.1.5 Profiling and Preprocessing 21
5.1.6 Models 21
5.2 Analysis of Throughput 23
5.2.1 Granularity of Partitioning 23
5.2.2 Communication Time 23
5.3 Analysis of Partition Times 25
Chapter 6 Conclusion 28
References 29
-
dc.language.isoen-
dc.subject動態規劃zh_TW
dc.subject管線平行訓練zh_TW
dc.subject機器學習zh_TW
dc.subject深度神經網路zh_TW
dc.subject圖分割zh_TW
dc.subjectgraph partitioningen
dc.subjectdeep neural networken
dc.subjectmachine learningen
dc.subjectdynamic programmingen
dc.subjectpipeline parallelen
dc.title多圖形處理器環境下優化深度學習網絡的管線平行訓練zh_TW
dc.titleExecution Time Optimization for Pipeline Deep Network Training on Multiple GPUsen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee洪鼎詠;吳真貞zh_TW
dc.contributor.oralexamcommitteeDing-Yong Hong;Jan-Jan Wuen
dc.subject.keyword管線平行訓練,動態規劃,圖分割,深度神經網路,機器學習,zh_TW
dc.subject.keywordpipeline parallel,dynamic programming,graph partitioning,deep neural network,machine learning,en
dc.relation.page32-
dc.identifier.doi10.6342/NTU202401090-
dc.rights.note未授權-
dc.date.accepted2024-07-08-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
  未授權公開取用
1.2 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved