請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97080| 標題: | 針對通用圖形處理器伺服器的大規模模型高效訓練之計算排程最佳化 Optimizing Computation Scheduling for Efficient Large-Scale Model Training on Commodity GPU Servers |
| 作者: | 許程翔 Cheng-Hsiang Hsu |
| 指導教授: | 楊佳玲 Chia-Lin Yang |
| 關鍵字: | 深度學習訓練,高效能運算,分散式訓練,管線平行化,記憶體管理, deep learning training,high performance computing,distributed training,pipeline parallelism,memory management, |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 在大型模型的訓練中,商品GPU伺服器(commodity GPU servers)通常依賴管線平行化(pipeline parallelism)將模型分散到多個 GPU 上,以降低單個GPU的記憶體峰值使用量。然而,當模型規模超過多個GPU的總記憶體容量時,模型的訓練仍會面對到記憶體容量限制的問題,此時使用主記憶體作為外部記憶體是一種有效的解決方案。將最先進的置換(swapping)技術應用於管線平行化時,即使透過這些技術,模型參數仍無法及時從主記憶體傳輸到GPU,進而導致延遲和管線氣泡(pipeline bubbles),嚴重影響訓練效能。為了解決此問題,本研究提出了一種新的機制,能夠將佇列中的計算動態地安排到置換操作產生的閒置中。透過重疊計算與資料傳輸,此機制有效減少計算資源閒置時間並提升整體資源利用率。實驗結果顯示,所提出的機制在訓練吞吐量方面可實現達1.18倍的提升。 Large-scale model training on commodity GPU servers relies on pipeline parallelism to distribute models across multiple GPUs and reduce peak memory usage. However, when the model size exceeds the combined memory capacity of the GPUs, leveraging host memory as external memory serves as an effective solution to overcome memory limitations. When applying state-of-the-art swapping techniques to pipeline parallelism, despite the assistant of these techniques, model parameters still failed to transfer from host memory to GPUs in time, results in I/O delays and pipeline bubbles, significantly degrading training performance. To alleviate this performance degradation, this research proposes a mechanism that dynamically schedules queued computations into the idle intervals introduced by swapping operations. By overlapping computation with data transfer, the mechanism reduces idling and enhances overall resource utilization. Experimental results demonstrate throughput improvements of up to 1.18× compared to baseline. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97080 |
| DOI: | 10.6342/NTU202500469 |
| 全文授權: | 未授權 |
| 電子全文公開日期: | N/A |
| 顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-1.pdf 未授權公開取用 | 1.13 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
