針對圖形處理器上的深度學習反向傳播計算之排程優化

Cing-Fu Jhu; 朱清福

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/22111

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒(Pangfeng Liu)
dc.contributor.author	Cing-Fu Jhu	en
dc.contributor.author	朱清福	zh_TW
dc.date.accessioned	2021-06-08T04:03:06Z	-
dc.date.copyright	2018-08-03
dc.date.issued	2018
dc.date.submitted	2018-08-03
dc.identifier.citation	[1]Backpropagation.https://en.wikipedia.org/wiki/Backpropagation. [2]cublas.https://developer.nvidia.com/cublas. [3]Cuda-unified-memory.https://devblogs.nvidia.com/unified-memory-in-cuda-6/. [4]cudnn.https://developer.nvidia.com/cudnn. [5]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving,M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas,O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-flow: Large-scale machine learning on heterogeneous distributed systems.CoRR,abs/1603.04467, 2016. [6]T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, andZ.Zhang.Mxnet: Aflexibleandefficientmachinelearninglibraryforheterogeneousdistributed systems.CoRR, abs/1512.01274, 2015. [7]T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear mem-ory cost.CoRR, abs/1604.06174, 2016. [8]T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Buildingan efficient and scalable deep learning training system. InProceedings of the 11thUSENIX Conference on Operating Systems Design and Implementation, OSDI’14,pages 571–582, Berkeley, CA, USA, 2014. USENIX Association. [9]R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa.Natural language processing (almost) from scratch.CoRR, abs/1103.0398, 2011. [10]H.Cui,H.Zhang,G.R.Ganger,P.B.Gibbons,andE.P.Xing. Geeps: Scalabledeeplearningondistributedgpuswithagpu-specializedparameterserver. InProceedingsoftheEleventhEuropeanConferenceonComputerSystems,EuroSys’16,pages4:1–4:16, New York, NY, USA, 2016. ACM. [11]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato,A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deepnetworks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,Advances in Neural Information Processing Systems 25, pages 1223–1231. CurranAssociates, Inc., 2012. [12]J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Visionand Pattern Recognition, pages 248–255, June 2009. [13]C. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment anal-ysis of short texts. InProceedings of COLING 2014, the 25th International Con-ference on Computational Linguistics: Technical Papers, pages 69–78. Dublin CityUniversity and Association for Computational Linguistics, 2014. [14]I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. Anasymmetric distributed shared memory model for heterogeneous parallel systems.SIGPLAN Not., 45(3):347–358, Mar. 2010. [15]A.GravesandJ.Schmidhuber. Framewisephonemeclassificationwithbidirectionallstmnetworks. InProceedings.2005IEEEInternationalJointConferenceonNeuralNetworks, 2005., volume 4, pages 2047–2052 vol. 4, July 2005. [16]S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections forefficient neural networks.CoRR, abs/1506.02626, 2015. [17]K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.CoRR, abs/1512.03385, 2015. [18]Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.B.Girshick,S.Guadarrama,and T. Darrell. Caffe: Convolutional architecture for fast feature embedding.CoRR,abs/1408.5093, 2014. [19]P. Judd, J. Albericio, T. H. Hetherington, T. M. Aamodt, N. D. E. Jerger, R. Urtasun,andA.Moshovos. Reduced-precisionstrategiesforboundedmemoryindeepneuralnets.CoRR, abs/1511.05236, 2015. [20]D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundationsof research. chapter Learning Representations by Back-propagating Errors, pages696–699. MIT Press, Cambridge, MA, USA, 1988. [21]X. Sierra-Canto, F. Madera-Ramirez, and V. Uc-Cetina. Parallel training of a back-propagation neural network using cuda. InProceedings of the 2010 Ninth Inter-national Conference on Machine Learning and Applications, ICMLA ’10, pages307–312, Washington, DC, USA, 2010. IEEE Computer Society. [22]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scaleimage recognition.CoRR, abs/1409.1556, 2014. [23]C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the im-pact of residual connections on learning.CoRR, abs/1602.07261, 2016. [24]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Van-houcke,andA.Rabinovich. Goingdeeperwithconvolutions.CoRR,abs/1409.4842,2014. [25]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap tohuman-level performance in face verification. In2014 IEEE Conference on Com-puter Vision and Pattern Recognition, pages 1701–1708, June 2014.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/22111	-
dc.description.abstract	許多大型的深層人工神經網絡模型在近幾年內已經被提出以實現更精確的訓練效果。這些巨大網路模型的訓練依賴大量的記憶體空間與通訊量，而這些都是在提高深度學習的效能方面具有挑戰性的問題。在本文中，我們分析了訓練深層神經網絡的資料存取模式，並提出一個減少圖形處理器上資料使用量與圖形處理器和主機之間的資料搬動量的資料固定演算法。我們證明了尋找一個最佳資料搬移的排程是NP完全的，並提出了一個可以找到最佳解的偽多項式時間的動態規劃演算法。亦即，我們觀察了深層神經網絡訓練的存取模式，並提出圖形處理器上專門的資料固定演算法，以最小化不必要的資料搬移。我們接著實施動態規劃，以訓練真正的深度學習模型。實驗證實，與 GeePS (一個先進的深度學習框架)相比，我們可以固定多20％的資料到圖形處理器的記憶體之中。此外，我們還提出了一個用於深度學習反向傳播計算的記憶需求減少技術。我們分析了深度學習反向傳播計算的存取模式，並認識到梯度計算與權重更新這兩個依序完成的主要步驟可以部分重疊。此外，我們分析了計算的語義，並意識到通過延遲權重更新，我們可以避免傳統的平行實作中由於讀寫衝突所需要的雙緩衝。我們接著實現我們的技術，並將參數梯度所需的記憶體使用率降低了75％。	zh_TW
dc.description.abstract	Many large deep neural network models have been proposed in recent years to achieve more accurate training results. The training of these large models requires a huge amount of memory and communication, which becomes a challenging issue in improving the performance of deep learning. In this paper, we analyze the data access pattern of training a deep neural network and propose a data pinning algorithm that reduces the data usage on GPU and the movement between a GPU and its CPU host. We show that to find an optimal data movement scheduling is NP-complete,and propose a dynamic programming that can find the optimal solution in pseudo polynomial time. That is, we observe the access pattern of the training of the deep neural network and propose specialized GPU data pinning algorithm that minimizes the unnecessary data movements. We then implement our dynamic programming on to train real deep learning models. The experiments show that we can pin up to 20% more data into GPU memory than GeePS, a state of art deep learning framework. We also propose memory reduction technique for back-propagation in deep learning. We analyzed the access pattern of back propagation in deep learning and realized that gradient computation and weight update, two transitionally sequentially done major steps, can be partially overlapped. In addition, we analyzed the semantics of the computation and realized that by delaying weight update we can avoid double buffering due to read/write conflicts in traditional naive parallel implementation. We then implement our techniques and observe up to 75% reduction in memory usage.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T04:03:06Z (GMT). No. of bitstreams: 1 ntu-107-R05922163-1.pdf: 1345212 bytes, checksum: 8276ce2da897793b81c83795ec9b1fef (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	論文口試委員審定書i 誌謝ii 摘要iii Abstractiv 1 Introduction1 2 Deep Learning4 2.1 Deep Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .4 2.2 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2.3 Back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2.4 Weight Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 2.5 Deep Learning using GPU . . . . . . . . . . . . . . . . . . . . . . . . .6 3 Related Works7 3.1 Model Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 3.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 3.3 Memory Cost Reduction . . . . . . . . . . . . . . . . . . . . . . . . . .8 3.4 Optimizations for Training Deep Neural Networks . . . . . . . . . . . .9 4 GPU Memory Pinning10 4.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 4.3 NP-Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 4.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . .14 5 Backward Propagation Optimization16 6 Experiments19 6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 6.2 Dataset and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 6.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 6.3.1 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 6.3.2 GPU Memory Pinning . . . . . . . . . . . . . . . . . . . . . . .22 6.3.3 Concurrent Update . . . . . . . . . . . . . . . . . . . . . . . . .22 7 Conclusion25 Bibliography26
dc.language.iso	en
dc.title	針對圖形處理器上的深度學習反向傳播計算之排程優化	zh_TW
dc.title	Scheduling Optimization of Backpropagation for Deep Learning on GPU	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳真貞,徐慰中
dc.subject.keyword	深度學習,反向傳播,動態規劃,記憶體優化,	zh_TW
dc.subject.keyword	Deep Learning,Back-propagation,Dynamic Programming,Memory Optimization,	en
dc.relation.page	29
dc.identifier.doi	10.6342/NTU201802320
dc.rights.note	未授權
dc.date.accepted	2018-08-03
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 目前未授權公開取用	1.31 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。