請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98874完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 劉邦鋒 | zh_TW |
| dc.contributor.advisor | Pangfeng Liu | en |
| dc.contributor.author | 江明彥 | zh_TW |
| dc.contributor.author | Ming-Yen Chiang | en |
| dc.date.accessioned | 2025-08-20T16:06:57Z | - |
| dc.date.available | 2025-08-21 | - |
| dc.date.copyright | 2025-08-20 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-13 | - |
| dc.identifier.citation | O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova. Optimal re-materialization strategies for heterogeneous chains: How to train deep neural networks with limited memory. ACM Trans. Math. Softw., 50(2), June 2024.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020. T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost, 2016. arXiv preprint arXiv:1604.06174, 2016. S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021. J. Feng and D. Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. D.-Y. Hong, T.-H. Tsai, N. Wang, P. Liu, and J.-J. Wu. Gpu memory usage optimization for backward propagation in deep network training. Journal of Parallel and Distributed Computing, page 105053, 2025. Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019. I. Jain, A. Jaiswal, T. Dettmers, F. Kjolstad, S. Khasnabish, and A. Grover. Reducing activation recomputation in large transformer models. In Proceedings of the 5th MLSys Conference, MLSys ’22. Association for Computing Machinery, 2022. A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (bit): General visual representation learning, 2020. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019. D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatronlm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. Z. Sun, H. Cao, Y. Wang, G. Feng, S. Chen, H. Wang, and W. Chen. Adapipe: Optimizing pipeline parallelism with adaptive recomputation and partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 86–100, 2024. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98874 | - |
| dc.description.abstract | 訓練大型深度神經網路需要大量的運算時間與 GPU 記憶體。而管線平行化(Pipeline parallelism)是一種解決方案,它將神經網路模型切分成數個連續的階段,並分配到不同的 GPU 上,這樣就沒有單一裝置需要儲存整個模型。管線平行化接著將每個小批次(mini-batch)資料拆分為更小的微批次(micro-batches),讓所有階段能同步處理不同的微批次。然而,每個微批次在前向傳播中產生的中間激活值(intermediate activation)必須一直保留到其反向傳播完成,這會對記憶體造成極大壓力。激活檢查點(Activation checkpointing)能有效緩解此問題:它在前向傳播時僅儲存部分激活值,並在反向傳播時重新計算其餘激活值,以此用額外的計算量換取記憶體上的節省。
我們提出了一種兩階段方法,能在每個 GPU 的記憶體限制下,同時決定最佳的管線切分與激活檢查點設定。首先,我們為每個候選階段選擇最佳的檢查點組合,以最小化執行時間,並確保記憶體用量不超出限制。接著,根據這些階段的成本,我們使用動態規劃(dynamic programming)來找出一個平衡的切分。我們的演算法經實作與實驗證明,可將訓練吞吐量提升高達 1.23 倍。 | zh_TW |
| dc.description.abstract | Training large deep neural networks demands enormous computing time and GPU memory. Pipeline parallelism alleviates these demands by partitioning a neural network into sequential stages that reside on different GPUs, so that no single device has to store the full set of model parameters. Pipeline parallelism then divides each mini-batch into micro-batches, allowing all stages to work on different micro-batches concurrently. However, the intermediate activation for every micro-batch must be kept until its backward pass completes, creating severe memory pressure. Activation checkpointing mitigates this by caching only a subset of activations during the forward pass and recomputing the others during the backward pass, trading extra computation for memory savings.
We present a two-step approach that jointly determines the pipeline partition and the checkpoint set under per-GPU memory constraints. First, we select the optimal set of checkpoints that minimizes the execution time of each candidate stage while ensuring that it does not exceed the device's memory. Second, given these per‑stage costs, we use dynamic programming to find a balanced partition. We implement our algorithm and conduct experiments, achieving up to a 1.23-fold increase in training throughput. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:06:57Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-20T16:06:57Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員審定書 i
致謝 ii 摘要 iii Abstract iv Contents vi List of Figures viii Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 Pipeline Parallelsim 4 2.2 Activation Checkpointing 5 Chapter 3 Background 7 3.1 Fast-forward Serial Graph 7 3.2 Memory Model of PyTorch’s Checkpointing 8 3.3 Implementation of PyTorch’s Checkpointing on Fast-forward Serial Graph 10 Chapter 4 Problem Formulation 12 4.1 Optimal Checkpoint Selection of the Given Layers 12 4.2 Pipeline Partition 14 Chapter 5 Algorithm 15 5.1 Dynamic Programming for Optimal Checkpoint Selection Problem 15 5.2 Dynamic Programming for Pipeline Partition 19 Chapter 6 Experiment 20 6.1 Experimental Setup 20 6.1.1 Implementation Details 20 6.1.2 Hardware Configuration 20 6.1.3 Baseline 20 6.1.4 Models 21 6.2 Memory Model Prediction versus PyTorch Report 22 6.3 Analysis of Throughput 23 Chapter 7 Conclusion and Future Work 27 References 28 | - |
| dc.language.iso | en | - |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 管線平行化 | zh_TW |
| dc.subject | 激活檢查點 | zh_TW |
| dc.subject | 動態規劃 | zh_TW |
| dc.subject | Activation Checkpointing | en |
| dc.subject | Deep Learning | en |
| dc.subject | Dynamic Programming | en |
| dc.subject | Pipeline Parallelism | en |
| dc.title | 結合激活檢查點以優化深度學習之管線平行化 | zh_TW |
| dc.title | Optimizing Pipeline Parallelism for Deep Learning with Activation Checkpointing | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 洪鼎詠;吳真貞 | zh_TW |
| dc.contributor.oralexamcommittee | Ding-Yong Hong;Jan-Jan Wu | en |
| dc.subject.keyword | 深度學習,管線平行化,激活檢查點,動態規劃, | zh_TW |
| dc.subject.keyword | Deep Learning,Pipeline Parallelism,Activation Checkpointing,Dynamic Programming, | en |
| dc.relation.page | 30 | - |
| dc.identifier.doi | 10.6342/NTU202503990 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-08-15 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 587.63 kB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
