多圖形處理器環境下訓練樹狀模型的高效能資料分群演算法

林采鋒; Cai-Feng Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96374

標題:	多圖形處理器環境下訓練樹狀模型的高效能資料分群演算法 A Grouping Algorithm for Training Tree-Shaped Models on Multiple GPUs with High Efficiency
作者:	林采鋒 Cai-Feng Lin
指導教授:	劉邦鋒 Pangfeng Liu
關鍵字:	深度學習,圖神經網路,分區問題,NP完備,近似演算法, Deep Learning,Graph Neural Network,Partition Problem,NP-Completeness,Approximation Algorithm,
出版年 :	2025
學位:	碩士
摘要:	圖神經網路是深度學習中處理結構化數據的重要工具，其中圖由節點與邊組成，用於表示實體及其關係。當圖神經網路呈樹狀結構時會面臨各種挑戰，例如不規則的結構與變化的深度。將其分配與並行執行於多張圖形處理器上極具挑戰性。此外，樹狀結構的相依性要求先處理父節點再處理子節點，這嚴重限制了執行的並行性。本研究旨在提升樹狀得圖神經網路在多圖形處理器系統上的訓練速度。首先，我們引入一個成本模型來估算多圖形處理器訓練的執行時間。接著，我們證明在該成本模型下，尋找最佳的樹狀數據分配方式是一個NP完全問題。因此，我們提出一種啟發式方法來分配數據，以提高效率並維持訓練品質。此方法首先基於成本模型將數據分配至不同批次，然後在每個批次內將數據分配至多個圖形處理器。我們進一步證明，我們的分配演算法是一個四倍近似演算法，確保其在實務上具有效率。我們實作了該演算法並進行實驗。結果顯示，我們的演算法在訓練時間上獲得顯著提升。當使用兩張圖形處理器時加速比達 1.86，四張圖形處理器達 3.43，而八張圖形處理器達 7.25。 Graph Neural Network (GNN) is an important tool in deep learning to handle structured data, where graphs with nodes and edges represent entities and their relationships. Various challenges arise when GNN is tree-shaped, with irregular connectivity patterns and varying depth. It is difficult to distribute and process the dynamic structure for parallel execution on multiple GPUs. In addition, tree data dependency demands the processing of parent nodes before their children, severely limiting execution parallelism. This research aims to improve the training speed of tree-shaped GNN on multi-GPU systems. First, we introduce a cost model that estimates the running time of the training across multiple GPUs. Then, we demonstrate that finding an optimal way to distribute tree-structured data across GPUs is an NP-complete problem on this cost model. We then propose a practical heuristic method for distributing data that improves efficiency while maintaining training quality. The heuristic method first assigns data to batches based on our cost model and then assigns data in each batch to the devices. We also show that our device-assigning algorithm is a 4-approximation algorithm. That is, it guarantees that its cost is four times the optimal running time in each training batch, ensuring that it performs effectively in practice. We implement the algorithm and conduct the experiments. The results show that our algorithm achieves a significant increase in training time. The speedup is up to 1.86 for two GPUs, 3.43 for four GPUs, and 7.25 for eight GPUs.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96374
DOI:	10.6342/NTU202500427
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-02-14
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf	694.87 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。