結合激活檢查點以優化深度學習之管線平行化

江明彥; Ming-Yen Chiang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98874

標題:	結合激活檢查點以優化深度學習之管線平行化 Optimizing Pipeline Parallelism for Deep Learning with Activation Checkpointing
作者:	江明彥 Ming-Yen Chiang
指導教授:	劉邦鋒 Pangfeng Liu
關鍵字:	深度學習,管線平行化,激活檢查點,動態規劃, Deep Learning,Pipeline Parallelism,Activation Checkpointing,Dynamic Programming,
出版年 :	2025
學位:	碩士
摘要:	訓練大型深度神經網路需要大量的運算時間與 GPU 記憶體。而管線平行化（Pipeline parallelism）是一種解決方案，它將神經網路模型切分成數個連續的階段，並分配到不同的 GPU 上，這樣就沒有單一裝置需要儲存整個模型。管線平行化接著將每個小批次（mini-batch）資料拆分為更小的微批次（micro-batches），讓所有階段能同步處理不同的微批次。然而，每個微批次在前向傳播中產生的中間激活值（intermediate activation）必須一直保留到其反向傳播完成，這會對記憶體造成極大壓力。激活檢查點（Activation checkpointing）能有效緩解此問題：它在前向傳播時僅儲存部分激活值，並在反向傳播時重新計算其餘激活值，以此用額外的計算量換取記憶體上的節省。我們提出了一種兩階段方法，能在每個 GPU 的記憶體限制下，同時決定最佳的管線切分與激活檢查點設定。首先，我們為每個候選階段選擇最佳的檢查點組合，以最小化執行時間，並確保記憶體用量不超出限制。接著，根據這些階段的成本，我們使用動態規劃（dynamic programming）來找出一個平衡的切分。我們的演算法經實作與實驗證明，可將訓練吞吐量提升高達 1.23 倍。 Training large deep neural networks demands enormous computing time and GPU memory. Pipeline parallelism alleviates these demands by partitioning a neural network into sequential stages that reside on different GPUs, so that no single device has to store the full set of model parameters. Pipeline parallelism then divides each mini-batch into micro-batches, allowing all stages to work on different micro-batches concurrently. However, the intermediate activation for every micro-batch must be kept until its backward pass completes, creating severe memory pressure. Activation checkpointing mitigates this by caching only a subset of activations during the forward pass and recomputing the others during the backward pass, trading extra computation for memory savings. We present a two-step approach that jointly determines the pipeline partition and the checkpoint set under per-GPU memory constraints. First, we select the optimal set of checkpoints that minimizes the execution time of each candidate stage while ensuring that it does not exceed the device's memory. Second, given these per‑stage costs, we use dynamic programming to find a balanced partition. We implement our algorithm and conduct experiments, achieving up to a 1.23-fold increase in training throughput.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98874
DOI:	10.6342/NTU202503990
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	587.63 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。