利用參數伺服器在深度學習中應用多樣化的通訊最佳化

Po-Yen Wu; 吳伯彥

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/1196

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒(Pangfeng Liu)
dc.contributor.author	Po-Yen Wu	en
dc.contributor.author	吳伯彥	zh_TW
dc.date.accessioned	2021-05-12T09:34:04Z	-
dc.date.available	2020-07-19
dc.date.available	2021-05-12T09:34:04Z	-
dc.date.copyright	2018-07-19
dc.date.issued	2018
dc.date.submitted	2018-07-09
dc.identifier.citation	[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc. [2] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. [6] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. [7] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016. [8] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR, abs/1512.02595, 2015. [9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025, 2015. [10] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. CoRR, abs/1504.00702, 2015. [11] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/1508.06576, 2015. [12] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA, 2014. USENIX Association. [13] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 571–582, Berkeley, CA, USA, 2014. USENIX Association. [14] Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1335–1344, New York, NY, USA, 2015. ACM. [15] Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, pages 4:1–4:16, New York, NY, USA, 2016. ACM. [16] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5330–5340. Curran Associates, Inc., 2017. [17] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A 28system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016. [18] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015. [19] Frank Seide and Amit Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM. [20] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. Heterogeneity-aware distributed parameter servers. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 463–478, New York, NY, USA, 2017. ACM. [21] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. 2018. [22] Peter H. Jin, Qiaochu Yuan, Forrest N. Iandola, and Kurt Keutzer. How to scale distributed deep learning? 2016. [23] Richard E. Korf. Multi-way number partitioning. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI’09, pages 538–543, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc. [24] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 693–701, USA, 2011. Curran Associates Inc. [25] Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, pages 1223–1231, USA, 2013. Curran Associates Inc. [26] Jan-Jan Wu Li-Yung Ho and Pangfeng Liu. Adaptive communication for distributed deep learning on commodity gpu cluster. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2018), CCGRID’18, 2018. [27] Yusuke Tsuzuku, Hiroto Imachi, and Takuya Akiba. Variance-based gradient compression for efficient distributed deep learning. CoRR, abs/1802.06058, 2018. [28] Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’17, pages 181–193, Berkeley, CA, USA, 2017. USENIX Association. [29] Pangfeng Liu Ching-Yuan Tsai, Ching-Chi Lin and Jan-Jan Wu. Computation and communication scheduling optimization for distributed deep learning systems. [30] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1737–1746. JMLR.org, 2015. [31] Tim Dettmers. 8-bit approximations for parallelism in deep learning. CoRR, abs/1511.04561, 2015. [32] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J Pai, and Naveen Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1742–1752. Curran Associates, Inc., 2017. [33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014. [34] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad- vances in Neural Information Processing Systems 30, pages 1709–1720. Curran Associates, Inc., 2017. [35] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1509–1519. Curran Associates, Inc., 2017. [36] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In 30INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 1488–1492, 2015. [37] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 2017. [38] dmlc. Ps-lite. https://github.com/dmlc/ps-lite, 2015. [39] Microsoft. Multiverso. https://github.com/Microsoft/Multiverso, 2016. [40] Henggang Cui, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei Dai, Jesse HaberKucharsky, Qirong Ho, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Exploiting iterative-ness for parallel ml computations. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, pages 5:1–5:14, New York, NY, USA, 2014. ACM. [41] Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC ’15, pages 381–394, New York, NY, USA, 2015. ACM. [42] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-distributed machine learning approaching LAN speeds. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 629–647, Boston, MA, 2017. USENIX Association. [43] Sebastian Ruder. An overview of gradient descent optimization algorithms. abs/1609.04747, 2016. [44] Google. grpc. https://grpc.io. [45] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10, 2009.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/handle/123456789/1196	-
dc.description.abstract	深度學習已經成為最有希望解決人工智慧問題的方法之一。有效率地訓練一個大規模深度學習模型非常具有挑戰性，一個廣泛使用的加速方法是利用集中式的參數伺服器將計算分散到多臺工作節點上。為了克服因工作節點與參數伺服器交換資料而造成的通訊成本，通常會採用三種最佳化方法：資料放置、一致性控制和壓縮。在本文中，我們提出了模組化參數伺服器架構，其具有多個容易覆蓋的關鍵元件。這讓開發者可以輕鬆地將最佳化技術整合至訓練過程中，而不必在現有系統中使用特殊的方式實作。通過這個平臺，使用者能分析不同技術組合，並開發新的最佳化演算法。實驗結果顯示，和 Google 的分散式 Tensorflow 相比，藉由結合多種最佳化技巧，基於模組化參數伺服器的分散式訓練系統在運算上能夠達到接近線性的加速，並在減少一半訓練時間的同時保持收斂的準確度。	zh_TW
dc.description.abstract	Deep learning has become one of the most promising approaches to solve the artificial intelligence problems. Training large-scale deep learning models efficiently is challenging. A widely used approach to accelerate the training process is by distributing the computation across multiple nodes with a centralized parameter server. To overcome the communication overhead caused by exchanging information between workers and the parameter server, three types of optimization methods are adopted -- data placement, consistency control, and compression. In this paper, we proposed modularized parameter server, an architecture composed of key components that can be overridden without much effort. This allows developers to easily incorporate optimization techniques in the training process instead of using ad-hoc ways in existing systems. With this platform, the users can analyze different combinations of techniques and develop new optimization algorithms. The experiment results show that, compared with Google's distributed Tensorflow, our distributed training system based on the proposed modularized parameter server can achieve near-linear speedup for computing and reduce half of the training time by combining multiple optimization techniques while maintaining the convergent accuracy.	en
dc.description.provenance	Made available in DSpace on 2021-05-12T09:34:04Z (GMT). No. of bitstreams: 1 ntu-107-R00922008-1.pdf: 1449793 bytes, checksum: 467a0fedaf6bc722f5a4a95c9aeb022d (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	Acknowledgement ii Chinese Abstract iii Abstract iv 1 Introduction 1 2 Background 5 2.1 Deep Learning 5 2.2 Distributed Training 6 2.3 Communication Optimization 7 2.3.1 Placement 8 2.3.2 Consistency Control 8 2.3.3 Compression 9 3 Related Work 11 4 Architecture 12 4.1 Modularized Parameter Server 12 4.2 Use Case 14 4.3 Distributed Training System 17 5 Evaluation 19 6 Conclusion 26
dc.language.iso	en
dc.title	利用參數伺服器在深度學習中應用多樣化的通訊最佳化	zh_TW
dc.title	Versatile Communication Optimization for Deep Learning by Modularized Parameter Server	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳真貞(Jan-Jan Wu),徐慰中(Wei-Chung Hsu)
dc.subject.keyword	深度學習,分散式訓練,參數伺服器,模組化架構,通訊最佳化,	zh_TW
dc.subject.keyword	deep learning,distributed training,parameter server,modular architecture,communication optimization,	en
dc.relation.page	31
dc.identifier.doi	10.6342/NTU201801371
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2018-07-10
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	1.42 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。