深度網路訓練中反向傳播的GPU內存使用優化

王甯; Ning Wang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88653

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒	zh_TW
dc.contributor.advisor	Pangfeng Liu	en
dc.contributor.author	王甯	zh_TW
dc.contributor.author	Ning Wang	en
dc.date.accessioned	2023-08-15T17:13:59Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-15	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-07	-
dc.identifier.citation	O. Beaumont, L. Eyraud-Dubois, and A. Shilova, “Efficient combination of rematerialization and offloading for training dnns,” in Neural Information Processing Systems, 2021. T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016. C. chin Huang, G. Jin, and J. Li, “Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,” Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020. J. Feng and D. Huang, “Optimal gradient checkpoint search for arbitrary computation graphs,” 2021. A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible residual network: Backpropagation without storing activations,” in NIPS, 2017. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves, “Memory-efficient backpropagation through time,” in NIPS, 2016. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International Conference on Machine Learning, 2015. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” arXiv: Computer Vision and Pattern Recognition, 2015. S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural network,” in NIPS, 2015. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015. J. Herrmann, O. Beaumont, L. Eyraud-Dubois, J. Hermann, A. Joly, and A. Shilova, “Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory,” 2019. J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 2011–2023, 2017. P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and A. Moshovos, “Proteus: Exploiting numerical precision variability in deep neural networks,” in Proceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY, USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.1145/2925426.2926294 M. Kirisame, S. Lyubomirsky, A. Haan, J. Brennan, M. He, J. Roesch, T. Chen, and Z. Tatlock, “Dynamic tensor rematerialization,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=Vfs_-2RnOD0H A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. T. D. Le, H. Imai, Y. Negishi, and K. Kawachiya, “Tflms: Large model support in tensorflow by graph rewriting,” ArXiv, vol. abs/1807.02037, 2018. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ̈opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019. B. Pudipeddi, M. Mesmakhosroshahi, J. Xi, and S. Bharadwaj, “Training large neural networks with constant memory using a new execution algorithm,” ArXiv, vol. abs/2002.05645, 2020. M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, 2016. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. N. S. Sohoni, C. R. Aberger, M. Leszczynski, J. Zhang, and C. R ́e, “Low-memory neural network training: A technical report,” 2022. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAI’17. AAAI Press, 2017, p. 4278–4284. M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” ArXiv, vol. abs/1905.11946, 2019. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. R. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” ArXiv, vol. abs/1609.08144, 2016. Z. Wu, C. Shen, and A. van den Hengel, “High-performance semantic segmentation using very deep fully convolutional networks,” ArXiv, vol. abs/1604.04339, 2016. J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” Trans. Mach. Learn. Res., vol. 2022, 2022. S. Zagoruyko and N. Komodakis, “Wide Residual Networks,” arXiv e-prints, p. arXiv:1605.07146, May 2016. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88653	-
dc.description.abstract	在現代深度學習中，設計更大的深度神經網絡（DNN）來執行更複雜的任務和更高的準確性已經成為一種趨勢。在另一方面，卷積神經網絡（CNN）已成為大多數計算機視覺任務的標準方法。但是，那這些卷積層的中間數據的內存分配可能會在模型訓練期間造成嚴重的內存壓力。許多解決方案已經被提出來解決該問題。除了依賴於硬件的解決方案之外，有一個通用方法稱為使用運算換取記憶體空間，它可以通過增加計算量來減少 GPU 內存的使用。它延遲了前傳遞過程中部份層的子集的激勵計算，並在後向階段批量重新計算它們，以節省 GPU 內存。在這篇論文中，我們將專注於有效率地找到最佳檢查點以在模型訓練期間達到最小記憶體峰值。我們首先會描述訓練神經網絡的理論背景以及所用到的數學方程。我們使用這些方程來確定前傳導以及倒傳遞過程中必須要用到的所有資料以計算模型的權重。我們首先確定檢查點選擇問題並提出時間複雜度為 O(n^3) 的動態規划算法解決尋找最優檢查點子集問題。通過大量的實驗，我們使用理論分析做出更準確的描述並基於我們的追蹤修正目標函數，並提出一個 O(n^2) 動態規划算法來查找最優檢查點子集。	zh_TW
dc.description.abstract	In modern Deep Learning, it has been a trend to design larger Deep Neural Networks(DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks(CNNs) have become the standard method for most computer vision tasks. However, the memory allocation for the intermediate data of these convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, the general methodology known as trading computation for memory or rematerialization can reduce GPU memory usage by trading computation for memory efficiently. It delays the computation of activations of a subset of layers during the forward phase to save GPU memory and recomputes them in batch during the backward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity O(n^3) to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an O(n^2) dynamic programming algorithm for finding the optimal checkpoint subset.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:13:59Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-15T17:13:59Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables ix Chapter 1 Introduction 1 1.1 The Memory Pressure Problem in Deep Learning . . . . . . . . . . . 1 1.2 Analysis of Backward Propagation . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Related Work 8 2.1 Trading Computation for Memory . . . . . . . . . . . . . . . . . . . 8 2.2 Finding Checkpoints Within a Given Budget Automatically . . . . . 9 2.3 Finding The Optimal Checkpoints for Arbitrary DNN Model . . . . . 10 2.4 Beyond Finding The Optimal Peak Memory Using The Checkpoint Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 3 Checkpoint Selection Problem 12 3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 4 PyTorch Implementation 16 4.1 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 5 Experiment 21 5.1 Environment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.1 Models and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 Algorithm Prediction versus PyTorch Report . . . . . . . . . . . . . 23 5.3 Comparison with O(√n) Memory Cost Algorithm . . . . . . . . . . 24 5.4 Comparison with Optimal Arbitrary Computation Graph Algorithm . 25 5.5 Comparison: The Peak Memory Usage on Different Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.6 The Granularity of Model Specs Matters: Testing The Optimal Checkpoints on AlexNet with Extended Model Specs . . . . . . . . . . . . 27 5.6.1 AlexNet: Using the Plain Specs . . . . . . . . . . . . . . . . . . . . 27 5.6.2 AlexNet: Standalone Max-Pooling Layers . . . . . . . . . . . . . . 28 Chapter 6 Conclusion 30 References 31	-
dc.language.iso	en	-
dc.subject	檢查點	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	動態規劃	zh_TW
dc.subject	記憶體優化	zh_TW
dc.subject	記憶體壓力	zh_TW
dc.subject	Dynamic Programming	en
dc.subject	Deep Learning	en
dc.subject	Checkpointing	en
dc.subject	memory pressure	en
dc.subject	memory usage optimization	en
dc.title	深度網路訓練中反向傳播的GPU內存使用優化	zh_TW
dc.title	GPU Memory Usage Optimization for Backward Propagation in Deep Network Training	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	吳真貞;洪鼎詠	zh_TW
dc.contributor.oralexamcommittee	Jan-Jan Wu;Ding-Yong Hong	en
dc.subject.keyword	深度學習,動態規劃,記憶體優化,記憶體壓力,檢查點,	zh_TW
dc.subject.keyword	Deep Learning,Dynamic Programming,memory usage optimization,memory pressure,Checkpointing,	en
dc.relation.page	35	-
dc.identifier.doi	10.6342/NTU202302572	-
dc.rights.note	未授權	-
dc.date.accepted	2023-08-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	997.3 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。