多繪圖處理器平台下運用模型平行化以實現高效並可靠之深度學習訓練

Chi-Chung Chen; 陳啟中

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72344

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Chi-Chung Chen	en
dc.contributor.author	陳啟中	zh_TW
dc.date.accessioned	2021-06-17T06:36:38Z	-
dc.date.available	2020-08-18
dc.date.copyright	2018-08-18
dc.date.issued	2018
dc.date.submitted	2018-08-16
dc.identifier.citation	[1] 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016. [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [3] A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 440–445. Association for Computational Linguistics, 2017. [4] D. Alistarh, J. Li, R. Tomioka, and M. Vojnovic. QSGD: randomized quantization for communication-optimal stochastic gradient descent. CoRR, abs/1610.02132, 2016. [5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. [6] P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors. Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, 2012. [7] J. Chen, R. Monga, S. Bengio, and R. Józefowicz. Revisiting distributed synchronous SGD. CoRR, abs/1604.00981, 2016. [8] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. [9] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In J. Flinn and H. Levy, editors, 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014., pages 571–582. USENIX Association, 2014. [10] J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. P. Xing. Solving the straggler problem with bounded staleness. In P. Maniatis, editor, 14th Workshop on Hot Topics in Operating Systems, HotOS XIV, Santa Ana Pueblo, New Mexico, USA, May 13-15, 2013. USENIX Association, 2013. [11] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Bartlett et al. [6], pages 1232–1240. [12] I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017. [13] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, and P. B. Gibbons. Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs/1806.03377, 2018. [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. [15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012. [16] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. Firecaffe: Near-linear acceleration of deep neural network training on compute clusters. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 [1], pages 2592–2600. [17] H. P. Inc. Falconwitch ps1816. http://www.h3platform.com/product/. [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [19] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016. [20] J. Kim, M. El-Khamy, and J. Lee. Residual LSTM: design of a deep recurrent architecture for distant speech recognition. In F. Lacerda, editor, Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pages 1591–1595. ISCA, 2017. [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. [22] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In Guyon et al. [12], pages 972–981. [23] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Bartlett et al. [6], pages 1106–1114. [25] A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 [1], pages 4013–4021. [26] B. Li, E. Zhou, B. Huang, J. Duan, Y. Wang, N. Xu, J. Zhang, and H. Yang. Large scale recurrent neural network on GPU. In 2014 International Joint Conference on Neural Networks, IJCNN 2014, Beijing, China, July 6-11, 2014, pages 4062–4069. IEEE, 2014. [27] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2737–2745, 2015. [28] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. CoRR, abs/1712.01887, 2017. [29] H. Ma, F. Mao, and G. W. Taylor. Theano-mpi: A theano-based distributed training framework. In F. Desprez, P. Dutot, C. Kaklamanis, L. Marchal, K. Molitorisz, L. Ricci, V. Scarano, M. A. Vega-Rodríguez, A. L. Varbanescu, S. Hunold, S. L. Scott, S. Lankes, and J. Weidendorfer, editors, Euro-Par 2016: Parallel Processing Workshops - Euro-Par 2016 International Workshops, Grenoble, France, August 24-26, 2016, Revised Selected Papers, volume 10104 of Lecture Notes in Computer Science, pages 800–813. Springer, 2016. [30] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 142–150. The Association for Computer Linguistics, 2011. [31] A. R. Mamidala, G. Kollias, C. Ward, and F. Artico. MXNET-MPI: embedding MPI parallelism in parameter server task model for scaling deep learning. CoRR, abs/1801.03855, 2018. [32] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. CoRR, abs/1312.5851, 2013. [33] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2430–2439. PMLR, 2017. [34] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 1310–1318. JMLR.org, 2013. [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. [36] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, editors, INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14- 18, 2014, pages 1058–1062. ISCA, 2014. [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [38] N. Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 1488–1492. ISCA, 2015. [39] C. Sun, Y. Zhang, W. Yu, R. Zhang, M. Z. A. Bhuiyan, and J. Li. DPS: A dsmbased parameter server for machine learning. In 14th International Symposium on Pervasive Systems, Algorithms and Networks & 11th International Conference on Frontier of Computer Science and Technology & Third International Symposium of Creative Computing, ISPAN-FCST-ISCC 2017, Exeter, United Kingdom, June 21- 23, 2017, pages 20–27. IEEE Computer Society, 2017. [40] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112, 2014. [41] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In S. P. Singh and S. Markovitch, editors, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 4278–4284. AAAI Press, 2017. [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Guyon et al. [12], pages 6000–6010. [43] L. Wu, Z. Zhu, and W. E. Towards understanding generalization of deep learning: Perspective of loss landscapes. CoRR, abs/1706.10239, 2017. [44] W. Zhang, S. Gupta, X. Lian, and J. Liu. Staleness-aware async-sgd for distributed deep learning. In S. Kambhampati, editor, Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2350–2356. IJCAI/AAAI Press, 2016. [45] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72344	-
dc.description.abstract	深度類神經網路的訓練需要大量運算，經常花費數天至數個禮拜才能完成。因此，運用多繪圖處理器平行計算以加速訓練深度類神經網路是現今常用的方法。其中資料平行化 (data parallelism) 由於其易於實作，是目前主流的作法。然而，使用資料平行化經常導致大量的繪圖處理器間資料傳輸 (inter-GPU communication) 而影響效能。另一種平行化的方式為模型平行化 (model parallelism) ，作法是讓各繪圖處理器分別負責一部分的類神經網路模型，此方法大幅降低了繪圖處理器間資料傳輸，但衍生出負載平衡 (load balance) 及權重老舊 (staleness issue) 的問題需要解決。本論文中，我們提出了一個創新的模型平行化方法，利用同步執行前向計算 (forward pass) 和後向計算 (backward pass) 以達到負載平衡，及提出權重預測 (weight prediction) 的機制以緩解權重老舊 (staleness issue) 的問題。實驗結果顯示，我們的方法可以得到相比於資料平行化多達 15.77 倍的加速，及與目前最新的模型平行化演算法相比取得多達 2.18 倍的加速，且不影響訓練的準確率。	zh_TW
dc.description.abstract	The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up process nowadays. Due to the implementation simplicity, data parallelism is currently the most commonly used parallelization method. Nonetheless, data parallelism suffers from excessive inter-GPU communication overhead due to frequent weight synchronization among GPUs. Another approach is model parallelism, which partitions model among GPUs. This approach can significantly reduce inter-GPU communication cost compared to data parallelism, however, maintaining load balance is a challenge. Moreover, model parallelism faces the staleness issue; that is, gradients are computed with stale weights. In this thesis, we propose a novel model parallelism method, which achieves load balance by concurrently executing forward and backward passes of two batches, and resolves the staleness issue with weight prediction. The experimental results show that our proposal achieves up to 15.77x speedup compared to data parallelism and up to 2.18x speedup compared to the state-of-the-art model parallelism method without incurring accuracy loss.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T06:36:38Z (GMT). No. of bitstreams: 1 ntu-107-R05922063-1.pdf: 984623 bytes, checksum: 5886a8b52b1a086003781b285c18d2fd (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 摘要 iii Abstract iv 1 Introduction 1 2 Background and Motivation 3 2.1 DNN Training 3 2.2 Parallel Training 3 2.3 Data Parallelism vs. Model Parallelism 5 3 DualPipe: Load Balanced and Robust Pipeline Design 9 3.1 Load Balanced Pipeline 9 3.2 Staleness Issue 11 3.3 SpecTrain: Staleness Mitigation via Weight Prediction 13 3.3.1 Weight Prediction 14 3.3.2 Prediction Accuracy 15 3.4 DualPipe Implementation 17 3.5 Summary 19 4 Experiments 20 4.1 Experiment Setup 20 4.2 Throughput 22 4.3 Performance Breakdown 23 4.4 Staleness and Convergence 26 4.5 Time to Convergence 28 5 Related Works 30 6 Conclusion 32 Bibliography 33
dc.language.iso	en
dc.subject	深度學習	zh_TW
dc.subject	平行運算	zh_TW
dc.subject	多繪圖處理器平台	zh_TW
dc.subject	流水線計算	zh_TW
dc.subject	誤差補償	zh_TW
dc.subject	Deep Learning	en
dc.subject	Multi-GPU Platform	en
dc.subject	Pipeline Processing	en
dc.subject	Error Compensation	en
dc.subject	Parallelism	en
dc.title	多繪圖處理器平台下運用模型平行化以實現高效並可靠之深度學習訓練	zh_TW
dc.title	Efficient and Robust Pipeline Design for Multi-GPU DNN Training through Model Parallelism	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	徐慰中(Wei-Chung Hsu),葉彌妍(Mi-Yen Yeh),鄭湘筠(Hsiang-Yun Cheng)
dc.subject.keyword	深度學習,平行運算,多繪圖處理器平台,流水線計算,誤差補償,	zh_TW
dc.subject.keyword	Deep Learning,Parallelism,Multi-GPU Platform,Pipeline Processing,Error Compensation,	en
dc.relation.page	39
dc.identifier.doi	10.6342/NTU201802788
dc.rights.note	有償授權
dc.date.accepted	2018-08-16
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 未授權公開取用	961.55 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。