在分散式系統中對於隨機梯度下降演算法的適應性溝通模式

Tai-An Chen; 陳泰安

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/22117

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒(Pangfeng Liu)
dc.contributor.author	Tai-An Chen	en
dc.contributor.author	陳泰安	zh_TW
dc.date.accessioned	2021-06-08T04:03:26Z	-
dc.date.copyright	2018-08-09
dc.date.issued	2018
dc.date.submitted	2018-08-02
dc.identifier.citation	[1] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. In International Conference on Learning Representations Workshop Track, 2016. [2] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. CoRR, abs/1202.2745, 2012. [3] Convnet. https://code.google.com/archive/p/cuda-convnet/. (Accessed: 2017-11- 26). [4] cudnn. https://devblogs.nvidia.com/deep-learning-computer-vision-caffe-cudnn/. (Accessed: 2018-06-10). [5] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, pages 4:1– 4:16, 2016. [6] D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan, D. D. Kalamkar, B. Kaul, and P. Dubey. Distributed deep learning using synchronous stochastic gradient descent. 2016. [7] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1223–1231, 2012. [8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. [9] A. M. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pages 278–288, 2015. [10] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. [11] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. Deep speech: Scaling up end-toend speech recognition. CoRR, abs/1412.5567, 2014. [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [13] P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. How to scale distributed deep learning? 2016. [14] A. Krizhevsky. Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf. (Accessed: 2017- 11-26). [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012. [16] J.-J. W. Li-Yung Ho and P. Liu. Adaptive communication for distributed deep learning on commodity gpu cluste. 2018 May. [17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. 05 2017. [18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [19] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. 550:354–359, 10 2017. [20] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015. [22] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 685–693, 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/22117	-
dc.description.abstract	分散式深度學習在發展人工智慧系統中扮演著重要的角色。如今,有研究提供了許多分散式學習演算法來加快訓練的過程,在這些演算法中,每台機器為了加速收斂必須經常交換梯度。但是,固定週期的梯度交換可能導致資料傳輸效率較差。在這篇論文中,我們提出了一個有效率的溝通方法,來提高隨機梯度下降演算法的效能。我們根據模型的變化來決定溝通的時機,當模型有巨量的變化得時候,我們將模型傳送給給其他的機器來計算新的平均結果。除此之外,我們動態地設置一個閾值來控制溝通週期。有了這種有效的溝通方法,我們可以減少溝通的傳輸量,從而提高效能。	zh_TW
dc.description.abstract	Distributed deep learning plays an important role to develop human-intelligent computer system. Nowadays, studies have provided many distributed learning algorithms to speedup the training process. In these algorithms, the workers have to frequently exchange gradients for fast convergence. However, gradients exchange in a fixed period could cause inefficient data transmission. In this paper, we propose an efficient communication method to improve the performance of gossiping stochastic gradient descent algorithm. We decide the timing for communication according to the change of the local model. When the local model changes significantly, we push the models to other workers to calculate a new averaged result. Besides, we dynamically set a threshold for the communication period. With this efficient communication method, we can reduce communication overhead and thus improve the performance.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T04:03:26Z (GMT). No. of bitstreams: 1 ntu-107-R02922045-1.pdf: 531868 bytes, checksum: 898c3de8679432de513b3b924ad224d5 (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	誌謝 ii 摘要 iii Abstract iv 1 Introduction 1 2 Related works 5 2.1 Synchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Asynchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 centralized architecture . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 decentralized architecture . . . . . . . . . . . . . . . . . . . . . 7 3 Preliminary 9 3.1 Caffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Gossiping stochastic gradient descent . . . . . . . . . . . . . . . . . . . 10 4 Architecture 12 5 Efficient Communication 15 5.1 The delta of a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Difference of model delta . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.3 Decision of threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.4 Communication mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6 Experiments 18 7 Conclusion 21 Bibliography 22
dc.language.iso	en
dc.title	在分散式系統中對於隨機梯度下降演算法的適應性溝通模式	zh_TW
dc.title	Adaptive Communication for Stochastic Gradient Descent in Distributed Deep Learning Systems	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳貞真(Jan-Jan Wu),洪鼎詠(Ding-Yong Hong)
dc.subject.keyword	深度學習,分散式學習,隨機梯度下降演算法,	zh_TW
dc.subject.keyword	Deep Learning,Distributed Learning,Stochastic Gradient Descent,	en
dc.relation.page	24
dc.identifier.doi	10.6342/NTU201801511
dc.rights.note	未授權
dc.date.accepted	2018-08-02
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 目前未授權公開取用	519.4 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。