Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98874
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor劉邦鋒zh_TW
dc.contributor.advisorPangfeng Liuen
dc.contributor.author江明彥zh_TW
dc.contributor.authorMing-Yen Chiangen
dc.date.accessioned2025-08-20T16:06:57Z-
dc.date.available2025-08-21-
dc.date.copyright2025-08-20-
dc.date.issued2025-
dc.date.submitted2025-08-13-
dc.identifier.citationO. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova. Optimal re-materialization strategies for heterogeneous chains: How to train deep neural networks with limited memory. ACM Trans. Math. Softw., 50(2), June 2024.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost, 2016. arXiv preprint arXiv:1604.06174, 2016.
S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
J. Feng and D. Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
D.-Y. Hong, T.-H. Tsai, N. Wang, P. Liu, and J.-J. Wu. Gpu memory usage optimization for backward propagation in deep network training. Journal of Parallel and Distributed Computing, page 105053, 2025.
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
I. Jain, A. Jaiswal, T. Dettmers, F. Kjolstad, S. Khasnabish, and A. Grover. Reducing activation recomputation in large transformer models. In Proceedings of the 5th MLSys Conference, MLSys ’22. Association for Computing Machinery, 2022.
A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (bit): General visual representation learning, 2020.
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019.
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatronlm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Z. Sun, H. Cao, Y. Wang, G. Feng, S. Chen, H. Wang, and W. Chen. Adapipe: Optimizing pipeline parallelism with adaptive recomputation and partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 86–100, 2024.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98874-
dc.description.abstract訓練大型深度神經網路需要大量的運算時間與 GPU 記憶體。而管線平行化(Pipeline parallelism)是一種解決方案,它將神經網路模型切分成數個連續的階段,並分配到不同的 GPU 上,這樣就沒有單一裝置需要儲存整個模型。管線平行化接著將每個小批次(mini-batch)資料拆分為更小的微批次(micro-batches),讓所有階段能同步處理不同的微批次。然而,每個微批次在前向傳播中產生的中間激活值(intermediate activation)必須一直保留到其反向傳播完成,這會對記憶體造成極大壓力。激活檢查點(Activation checkpointing)能有效緩解此問題:它在前向傳播時僅儲存部分激活值,並在反向傳播時重新計算其餘激活值,以此用額外的計算量換取記憶體上的節省。
我們提出了一種兩階段方法,能在每個 GPU 的記憶體限制下,同時決定最佳的管線切分與激活檢查點設定。首先,我們為每個候選階段選擇最佳的檢查點組合,以最小化執行時間,並確保記憶體用量不超出限制。接著,根據這些階段的成本,我們使用動態規劃(dynamic programming)來找出一個平衡的切分。我們的演算法經實作與實驗證明,可將訓練吞吐量提升高達 1.23 倍。
zh_TW
dc.description.abstractTraining large deep neural networks demands enormous computing time and GPU memory. Pipeline parallelism alleviates these demands by partitioning a neural network into sequential stages that reside on different GPUs, so that no single device has to store the full set of model parameters. Pipeline parallelism then divides each mini-batch into micro-batches, allowing all stages to work on different micro-batches concurrently. However, the intermediate activation for every micro-batch must be kept until its backward pass completes, creating severe memory pressure. Activation checkpointing mitigates this by caching only a subset of activations during the forward pass and recomputing the others during the backward pass, trading extra computation for memory savings.
We present a two-step approach that jointly determines the pipeline partition and the checkpoint set under per-GPU memory constraints. First, we select the optimal set of checkpoints that minimizes the execution time of each candidate stage while ensuring that it does not exceed the device's memory. Second, given these per‑stage costs, we use dynamic programming to find a balanced partition. We implement our algorithm and conduct experiments, achieving up to a 1.23-fold increase in training throughput.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:06:57Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-20T16:06:57Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
致謝 ii
摘要 iii
Abstract iv
Contents vi
List of Figures viii
Chapter 1 Introduction 1
Chapter 2 Related Work 4
2.1 Pipeline Parallelsim 4
2.2 Activation Checkpointing 5
Chapter 3 Background 7
3.1 Fast-forward Serial Graph 7
3.2 Memory Model of PyTorch’s Checkpointing 8
3.3 Implementation of PyTorch’s Checkpointing on Fast-forward Serial Graph 10
Chapter 4 Problem Formulation 12
4.1 Optimal Checkpoint Selection of the Given Layers 12
4.2 Pipeline Partition 14
Chapter 5 Algorithm 15
5.1 Dynamic Programming for Optimal Checkpoint Selection Problem 15
5.2 Dynamic Programming for Pipeline Partition 19
Chapter 6 Experiment 20
6.1 Experimental Setup 20
6.1.1 Implementation Details 20
6.1.2 Hardware Configuration 20
6.1.3 Baseline 20
6.1.4 Models 21
6.2 Memory Model Prediction versus PyTorch Report 22
6.3 Analysis of Throughput 23
Chapter 7 Conclusion and Future Work 27
References 28
-
dc.language.isoen-
dc.subject深度學習zh_TW
dc.subject管線平行化zh_TW
dc.subject激活檢查點zh_TW
dc.subject動態規劃zh_TW
dc.subjectActivation Checkpointingen
dc.subjectDeep Learningen
dc.subjectDynamic Programmingen
dc.subjectPipeline Parallelismen
dc.title結合激活檢查點以優化深度學習之管線平行化zh_TW
dc.titleOptimizing Pipeline Parallelism for Deep Learning with Activation Checkpointingen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee洪鼎詠;吳真貞zh_TW
dc.contributor.oralexamcommitteeDing-Yong Hong;Jan-Jan Wuen
dc.subject.keyword深度學習,管線平行化,激活檢查點,動態規劃,zh_TW
dc.subject.keywordDeep Learning,Pipeline Parallelism,Activation Checkpointing,Dynamic Programming,en
dc.relation.page30-
dc.identifier.doi10.6342/NTU202503990-
dc.rights.note未授權-
dc.date.accepted2025-08-15-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-liftN/A-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
587.63 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved