Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72344
Title: | 多繪圖處理器平台下運用模型平行化以實現高效並可靠之深度學習訓練 Efficient and Robust Pipeline Design for Multi-GPU DNN Training through Model Parallelism |
Authors: | Chi-Chung Chen 陳啟中 |
Advisor: | 楊佳玲(Chia-Lin Yang) |
Keyword: | 深度學習,平行運算,多繪圖處理器平台,流水線計算,誤差補償, Deep Learning,Parallelism,Multi-GPU Platform,Pipeline Processing,Error Compensation, |
Publication Year : | 2018 |
Degree: | 碩士 |
Abstract: | 深度類神經網路的訓練需要大量運算,經常花費數天至數個禮拜才能完成。因此,運用多繪圖處理器平行計算以加速訓練深度類神經網路是現今常用的方法。其中資料平行化 (data parallelism) 由於其易於實作,是目前主流的作法。然而,使用資料平行化經常導致大量的繪圖處理器間資料傳輸 (inter-GPU communication) 而影響效能。另一種平行化的方式為模型平行化 (model parallelism) ,作法是讓各繪圖處理器分別負責一部分的類神經網路模型,此方法大幅降低了繪圖處理器間資料傳輸,但衍生出負載平衡 (load balance) 及權重老舊 (staleness issue) 的問題需要解決。本論文中,我們提出了一個創新的模型平行化方法,利用同步執行前向計算 (forward pass) 和後向計算 (backward pass) 以達到負載平衡,及提出權重預測 (weight prediction) 的機制以緩解權重老舊 (staleness issue) 的問題。
實驗結果顯示,我們的方法可以得到相比於資料平行化多達 15.77 倍的加速,及與目前最新的模型平行化演算法相比取得多達 2.18 倍的加速,且不影響訓練的準確率。 The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up process nowadays. Due to the implementation simplicity, data parallelism is currently the most commonly used parallelization method. Nonetheless, data parallelism suffers from excessive inter-GPU communication overhead due to frequent weight synchronization among GPUs. Another approach is model parallelism, which partitions model among GPUs. This approach can significantly reduce inter-GPU communication cost compared to data parallelism, however, maintaining load balance is a challenge. Moreover, model parallelism faces the staleness issue; that is, gradients are computed with stale weights. In this thesis, we propose a novel model parallelism method, which achieves load balance by concurrently executing forward and backward passes of two batches, and resolves the staleness issue with weight prediction. The experimental results show that our proposal achieves up to 15.77x speedup compared to data parallelism and up to 2.18x speedup compared to the state-of-the-art model parallelism method without incurring accuracy loss. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72344 |
DOI: | 10.6342/NTU201802788 |
Fulltext Rights: | 有償授權 |
Appears in Collections: | 資訊工程學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-107-1.pdf Restricted Access | 961.55 kB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.