多圖形處理器上深度學習網路訓練的記憶體優化

吳榮哲; Rong-Jhe Wu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98943

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	劉邦鋒	zh_TW
dc.contributor.advisor	Pangfeng Liu	en
dc.contributor.author	吳榮哲	zh_TW
dc.contributor.author	Rong-Jhe Wu	en
dc.date.accessioned	2025-08-20T16:22:48Z	-
dc.date.available	2025-08-21	-
dc.date.copyright	2025-08-20	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-14	-
dc.identifier.citation	[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners, 2020. [2] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost, 2016. [3] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel. Palm: Scaling language modeling with pathways, 2022. [4] G. Fang, X. Ma, M. Song, M. B. Mi, and X. Wang. Depgraph: Towards any structural pruning, 2023. [5] A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. Memory-efficient backpropagation through time, 2016. [6] D.-Y. Hong, T.-H. Tsai, N. Wang, P. Liu, and J.-J. Wu. Gpu memory usage optimization for backward propagation in deep network training. Journal of Parallel and Distributed Computing, 199:105053, 2025. [7] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2019. [8] A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius. Accelerating sparse deep neural networks, 2021. [9] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. [10] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), pages 8024–8035. Curran Associates, Inc., 2019. [11] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models, 2020. [12] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. [13] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, E. Zhang, R. Child, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Houston, S. Tiwary, and B. Catanzaro. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98943	-
dc.description.abstract	深度神經網路已成為廣泛成功的框架，應用於多種領域。然而，現代應用越來越依賴更大型的模型以提升性能。參數數量的快速增長常導致訓練過程中出現記憶體瓶頸。一種有效的解決方案是激活檢查點（activation checkpointing），該方法只在前向傳播中保存部分中間激活值，並在反向傳播時重新計算這些激活值，以降低記憶體消耗。本文聚焦於在多GPU環境下訓練深度神經網路時，最小化記憶體使用。我們採用流水線並行（pipeline parallelism）將模型分割成較小的階段並分布於多個設備，並結合檢查點技術，在負載重的情況下進一步減少記憶體需求。我們的目標是找到能夠在大規模多 GPU 訓練過程中優化記憶體效率的檢查點策略。	zh_TW
dc.description.abstract	Deep neural networks have become a widely successful framework, applied in a wide range of applications. However, modern use cases increasingly rely on larger models to achieve better performance. This rapid growth in the number of parameters often results in memory bottlenecks during training. An effective approach to mitigate this issue is activation checkpointing, which involves storing only a subset of intermediate activations during the forward pass and recomputing them during the backward pass to reduce memory consumption. In this paper, we focus on minimizing memory usage when training deep neural networks across multiple GPUs. We employ pipeline parallelism to partition the model into smaller stages distributed across devices, and we apply checkpointing to further reduce memory demands under heavy workloads. Our goal is to identify checkpointing strategies that optimize memory efficiency during large-scale multi-GPU training.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:22:48Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-20T16:22:48Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 摘要 iii Abstract iv Contents v Chapter 1 Introduction 1 1.1 Activation Checkpointing . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Gpipe Pipeline Method . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Related Work 4 Chapter 3 Problem 5 3.1 Memory Model for Training on a Single GPU . . . . . . . . . . . . . 5 3.2 Memory Model for Training Across Multiple GPUs . . . . . . . . . . 6 3.3 Checkpoint Selection Problem with Multiple GPUs . . . . . . . . . . 7 Chapter 4 Algorithm 8 4.1 Dynamic Programming Algorithm . . . . . . . . . . . . . . . . . . . 8 4.2 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 5 Conclusion 10 References 11	-
dc.language.iso	en	-
dc.subject	深度學習	zh_TW
dc.subject	管線平行化	zh_TW
dc.subject	激活檢查點	zh_TW
dc.subject	動態規劃	zh_TW
dc.subject	Dynamic Programming	en
dc.subject	Activation Checkpointing	en
dc.subject	Deep Learning	en
dc.subject	Pipeline Parallelism	en
dc.title	多圖形處理器上深度學習網路訓練的記憶體優化	zh_TW
dc.title	Optimizing Memory Usage in Deep Network Training with Multiple GPUs	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	洪鼎詠;吳真貞	zh_TW
dc.contributor.oralexamcommittee	Ding-Yong Hong;Jan-Jan Wu	en
dc.subject.keyword	深度學習,管線平行化,激活檢查點,動態規劃,	zh_TW
dc.subject.keyword	Deep Learning,Pipeline Parallelism,Activation Checkpointing,Dynamic Programming,	en
dc.relation.page	13	-
dc.identifier.doi	10.6342/NTU202504398	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-08-15	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-08-21	-
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf	929.27 kB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets