Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21973
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor劉邦鋒(Pangfeng Liu)
dc.contributor.authorSheng-Ping Wangen
dc.contributor.author王盛平zh_TW
dc.date.accessioned2021-06-08T03:55:49Z-
dc.date.copyright2018-08-16
dc.date.issued2018
dc.date.submitted2018-08-15
dc.identifier.citation[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, 2017.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
[3] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97, 2012.
[4] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pages 649–666, 2016.
[5] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):664–676, 2017.
[6] Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. Visually indicated sounds. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2405–2413, 2016.
[7] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
[8] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[9] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
[10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[11] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 4278–4284, 2017.
[12] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 1058–1062, 2014.
[13] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1707–1718, 2017.
[14] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1508–1518, 2017.
[15] Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, and H. Brendan McMahan. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3329–3337, 2017.
[16] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 1488–1492, 2015.
[17] Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In 2nd Workshop on Machine Learning in HPC Environments, MLHPC@SC, Salt Lake City, UT, USA, November 14, 2016, pages 1–8, 2016.
[18] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 440–445, 2017.
[19] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
[20] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. CoRR, abs/1710.09854, 2017.
[21] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1232–1240, 2012.
[22] Sixin Zhang, Anna Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 685–693, 2015.
[23] Peter H. Jin, Qiaochu Yuan, Forrest N. Iandola, and Kurt Keutzer. How to scale distributed deep learning? CoRR, abs/1611.04581, 2016.
[24] Li-Yung Ho, Jan-Jan Wu, and Pangfeng Liu. Adaptive communication for distributed deep learning on commodity GPU cluster. In 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018, Washington, DC, USA, May 1-4, 2018, pages 283–290, 2018.
[25] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. More effective distributed ML via a stale synchronous parallel parameter server. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 1223–1231, 2013.
[26] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998.
[27] Po-Yen Wu. Versatile communication optimization for deep learning by modularized parameter server. Master’s thesis, National Taiwan University, Taiwan, 2018.
[28] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
[29] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016., pages 265–283, 2016.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21973-
dc.description.abstract網路的資料傳輸是分散式深度學習在增加訓練機器數量時所會面臨到的瓶頸,而解決辦法之一是將所交換的梯度進行鬆散化的壓縮。我們發現在資料交換的過程中,由伺服器傳輸至訓練機器的資料量大小,會隨著由訓練機器所傳出的梯度間的相似性增高而減少。我們由初步實驗觀察到,只有少部分的參數會在短期的時間之中多次計算出較大的梯度。藉由此觀察,我們提出了幾種讓訓練機器選擇傳出梯度的演算法,並透過實驗驗證我們的做法可使由伺服器所傳出的資料量減少,並縮短訓練週期所需的時間,使訓練模型較傳統壓縮方式更快到達收斂。zh_TW
dc.description.abstractCommunication usage is a bottleneck of scaling workers for distributed deep learning. One solution is to compress the exchanged gradients into sparse format with gradient sparsification. We found that the send cost of server, which is the aggregated size of sparse gradient, can be reduced by the gradient selection from workers. Following an observation that only a few gradients are significantly large and in a short period of time, we proposed several gradient selection algorithms based on different metrics. Experiment showed that our proposed method can reduce the aggregated size for server, and the reduction in time per iteration can make the convergence rate faster than traditional sparsification.en
dc.description.provenanceMade available in DSpace on 2021-06-08T03:55:49Z (GMT). No. of bitstreams: 1
ntu-107-R02922056-1.pdf: 720668 bytes, checksum: 8980029ce6e9d153c1790116bbda78a1 (MD5)
Previous issue date: 2018
en
dc.description.tableofcontents口試委員會審定書 i
致謝 ii
中文摘要 iii
Abstract iv
Contents v
List of Figures vii
List of Tables viii
1 Introduction 1
2 Related Work 4
2.1 Gradient Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Gradient Quantization . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Gradient Sparsification . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Relax Consistency Control . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Gradient Sparsification 7
4 Communication Model 9
4.1 Decentralized Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Centralized Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Two Models with the Same Cost . . . . . . . . . . . . . . . . . . . . . . 11
5 Proposed Method 13
5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Preliminary Observation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 Priority Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.4 Penalty Attempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Experiment 16
6.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 Observation of Large Gradient Elements . . . . . . . . . . . . . . . . . . 16
6.3 Method Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3.1 Frequency Approach . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3.2 Exponential Moving Average . . . . . . . . . . . . . . . . . . . 18
6.3.3 Penalty Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.3.4 Size Limitation on Server . . . . . . . . . . . . . . . . . . . . . 22
6.4 Time Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Conclusion 28
Bibliography 29
dc.language.isoen
dc.title對鬆散化梯度的資料聚集進行通訊量優化zh_TW
dc.titleCommunication Usage Optimization of Gradient Sparsification with Aggregationen
dc.typeThesis
dc.date.schoolyear106-2
dc.description.degree碩士
dc.contributor.oralexamcommittee吳真貞(Jan-Jan Wu),洪鼎詠(Ding-Yong Hong)
dc.subject.keyword平行處理,分散式系統,深度學習,鬆散化梯度,zh_TW
dc.subject.keywordParallel Processing,Distributed Systems,Deep Learning,Gradient Sparsification,en
dc.relation.page33
dc.identifier.doi10.6342/NTU201801738
dc.rights.note未授權
dc.date.accepted2018-08-15
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-107-1.pdf
  未授權公開取用
703.78 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved