多圖形處理器環境下訓練樹狀模型的高效能資料分群演算法

林采鋒; Cai-Feng Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96374

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒	zh_TW
dc.contributor.advisor	Pangfeng Liu	en
dc.contributor.author	林采鋒	zh_TW
dc.contributor.author	Cai-Feng Lin	en
dc.date.accessioned	2025-02-13T16:10:55Z	-
dc.date.available	2025-02-14	-
dc.date.copyright	2025-02-13	-
dc.date.issued	2025	-
dc.date.submitted	2025-02-06	-
dc.identifier.citation	[1] E. G. Coffman, Jr, M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM Journal on Computing, 7(1):1–17, 1978. [2] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin. Graph neural networks for social recommendation. In The world wide web conference, pages 417–426, 2019. [3] M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019. [4] D. K. Friesen. Tighter bounds for the multifit processor scheduling algorithm. SIAM Journal on Computing, 13(1):170–181, 1984. [5] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sg d: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. [6] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416–429, 1969. [7] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [8] S. Hochreiter. Long short-term memory. Neural Computation MIT-Press, 1997. [9] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [10] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020. [11] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [12] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15, pages 593–607. Springer, 2018. [13] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015. [14] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [15] M. Y. Wang. Deep graph library: Towards efficient and scalable deep learning on graphs. In ICLR workshop on representation learning on graphs and manifolds, 2019. [16] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu. Heterogeneous graph attention network. In The world wide web conference, pages 2022–2032, 2019. [17] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974–983, 2018. [18] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019. [19] B. Yu, H. Yin, and Z. Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875, 2017. [20] M. Yue. On the exact upper bound for the multifit processor scheduling algorithm. Annals of Operations Research, 24(1):233–259, 1990.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96374	-
dc.description.abstract	圖神經網路是深度學習中處理結構化數據的重要工具，其中圖由節點與邊組成，用於表示實體及其關係。當圖神經網路呈樹狀結構時會面臨各種挑戰，例如不規則的結構與變化的深度。將其分配與並行執行於多張圖形處理器上極具挑戰性。此外，樹狀結構的相依性要求先處理父節點再處理子節點，這嚴重限制了執行的並行性。本研究旨在提升樹狀得圖神經網路在多圖形處理器系統上的訓練速度。首先，我們引入一個成本模型來估算多圖形處理器訓練的執行時間。接著，我們證明在該成本模型下，尋找最佳的樹狀數據分配方式是一個NP完全問題。因此，我們提出一種啟發式方法來分配數據，以提高效率並維持訓練品質。此方法首先基於成本模型將數據分配至不同批次，然後在每個批次內將數據分配至多個圖形處理器。我們進一步證明，我們的分配演算法是一個四倍近似演算法，確保其在實務上具有效率。我們實作了該演算法並進行實驗。結果顯示，我們的演算法在訓練時間上獲得顯著提升。當使用兩張圖形處理器時加速比達 1.86，四張圖形處理器達 3.43，而八張圖形處理器達 7.25。	zh_TW
dc.description.abstract	Graph Neural Network (GNN) is an important tool in deep learning to handle structured data, where graphs with nodes and edges represent entities and their relationships. Various challenges arise when GNN is tree-shaped, with irregular connectivity patterns and varying depth. It is difficult to distribute and process the dynamic structure for parallel execution on multiple GPUs. In addition, tree data dependency demands the processing of parent nodes before their children, severely limiting execution parallelism. This research aims to improve the training speed of tree-shaped GNN on multi-GPU systems. First, we introduce a cost model that estimates the running time of the training across multiple GPUs. Then, we demonstrate that finding an optimal way to distribute tree-structured data across GPUs is an NP-complete problem on this cost model. We then propose a practical heuristic method for distributing data that improves efficiency while maintaining training quality. The heuristic method first assigns data to batches based on our cost model and then assigns data in each batch to the devices. We also show that our device-assigning algorithm is a 4-approximation algorithm. That is, it guarantees that its cost is four times the optimal running time in each training batch, ensuring that it performs effectively in practice. We implement the algorithm and conduct the experiments. The results show that our algorithm achieves a significant increase in training time. The speedup is up to 1.86 for two GPUs, 3.43 for four GPUs, and 7.25 for eight GPUs.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-13T16:10:55Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-02-13T16:10:55Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i Acknowledgements ii 摘要 iv Abstract v Contents vii List of Figures ix List of Tables x Chapter 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Related Work 4 2.1 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Processing Tree-Shaped Data . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Partition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Tree Grouping Problem 9 3.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 NP-Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 4 Scheme 13 4.1 Greedy Grouping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.1 Assigning Trees to Batches . . . . . . . . . . . . . . . . . . . . . . 13 4.1.2 Assigning Trees to Devices in a Batch . . . . . . . . . . . . . . . . . 14 4.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Sequential Execution with One Device . . . . . . . . . . . . . . . . . . . 19 Chapter 5 Experiment 22 5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1.2 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.1.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.1.5 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 Cost Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3 Analysis of Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.4 Analysis of Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 6 Conclusion and Future Work 30 References 31	-
dc.language.iso	en	-
dc.subject	NP完備	zh_TW
dc.subject	近似演算法	zh_TW
dc.subject	分區問題	zh_TW
dc.subject	圖神經網路	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	Approximation Algorithm	en
dc.subject	Deep Learning	en
dc.subject	Graph Neural Network	en
dc.subject	Partition Problem	en
dc.subject	NP-Completeness	en
dc.title	多圖形處理器環境下訓練樹狀模型的高效能資料分群演算法	zh_TW
dc.title	A Grouping Algorithm for Training Tree-Shaped Models on Multiple GPUs with High Efficiency	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	吳真貞;洪鼎詠	zh_TW
dc.contributor.oralexamcommittee	Jan-Jan Wu;Ding-Yong Hong	en
dc.subject.keyword	深度學習,圖神經網路,分區問題,NP完備,近似演算法,	zh_TW
dc.subject.keyword	Deep Learning,Graph Neural Network,Partition Problem,NP-Completeness,Approximation Algorithm,	en
dc.relation.page	33	-
dc.identifier.doi	10.6342/NTU202500427	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-02-06	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-02-14	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf	694.87 kB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。