針對通用圖形處理器伺服器的大規模模型高效訓練之計算排程最佳化

許程翔; Cheng-Hsiang Hsu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97080

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲	zh_TW
dc.contributor.advisor	Chia-Lin Yang	en
dc.contributor.author	許程翔	zh_TW
dc.contributor.author	Cheng-Hsiang Hsu	en
dc.date.accessioned	2025-02-26T16:21:27Z	-
dc.date.available	2025-02-27	-
dc.date.copyright	2025-02-26	-
dc.date.issued	2025	-
dc.date.submitted	2025-02-11	-
dc.identifier.citation	L. Bertaccini, G. Paulin, T. Fischer, S. Mach, and L. Benini. Minifloat-nn and exsdotp: An isa extension and a modular open hardware unit for low-precision training on risc-v cores. arXiv, 2022. Y. Bondarenko, M. Nagel, and T. Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. arXiv, 2021. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020. A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv, 2024. J. Fang and S. Zhao. A unified sequence parallelism approach for long context generative ai. arXiv, 2024. J. Fang, Z. Zhu, S. Li, H. Su, Y. Yu, J. Zhou, and Y. You. Parallel training of pretrained models via chunk-based dynamic memory management. IEEE Transactions on Parallel and Distributed Systems, 2023. M. Hildebrand, J. Khan, S. Trika, J. Lowe-Power, and V. Akella. Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2020. C.-C. Huang, G. Jin, and J. Li. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2020. Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 2019. S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhandari, and Y. He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv, 2023. J. Jung, J. Kim, and J. Lee. Deepum: Tensor migration and prefetching in unified memory. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023. D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv, 2020. S. Li and T. Hoefler. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2021. S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You. Sequence parallelism: Long sequence training from system perspective. arXiv, 2021. S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv, 2020. Z. Liu, Y. Wang, K. Han, S. Ma, and W. Gao. Instance-aware dynamic neural network quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. arXiv, 2018. J. Moon, D. Kim, J. Cheon, and B. Ham. Instance-aware group quantization for vision transformers. arXiv, 2024. S. R. Nandakumar, M. Le Gallo, C. Piveteau, V. Joshi, G. Mariani, I. Boybat, G. Karunaratne, R. Khaddam-Aljameh, U. Egger, A. Petropoulos, T. Antonakopoulos, B. Rajendran, A. Sebastian, and E. Eleftheriou. Mixed-precision deep learning based on computational memory. Frontiers in Neuroscience, 2020. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In ACM Symposium on Operating Systems Principles, 2019. D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, 2021. D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2021. NVIDIA. Amp, https://developer.nvidia.com/automatic-mixed-precision, 2022. NVIDIA. Nvidia rtx4090 gpu, https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/ 2022. NVIDIA. Nvidia h100 tensor core gpu, https://www.nvidia.com/en-us/data-center/h200/, 2024. NVIDIA, P. Vingelmann, and F. H. Fitzek. Cuda, https://developer.nvidia.com/cuda-toolkit, 2022. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019. X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian. Capuchin: Tensor-based gpu memory management for deep learning. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2020. PyTorch. Amp, https://pytorch.org/docs/stable/amp.html, 2020. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2020. S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2021. J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In IEEE International Symposium on High-Performance Computer Architecture, 2021. J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. Zero-offload: Democratizing billion-scale model training. In USENIX Annual Technical Conference, 2021. M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In IEEE/ACM International Symposium on Microarchitecture, 2016. M. Rusci, A. Capotondi, and L. Benini. Memory-driven mixed low precision quantization for enabling deep network inference on microcontrollers. arXiv, 2019. S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv, 2022. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv, 2023. A. Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. B. Wang, Q. Xu, Z. Bian, and Y. You. Tesseract: Parallelize the tensor parallelism efficiently. In International Conference on Parallel Processing, 2022. G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He. Zero++: Extremely efficient collective communication for giant model training. arXiv, 2023. Q. Xu, S. Li, C. Gong, and Y. You. An efficient 2d method for training super-large deep learning models. arXiv, 2021. Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv, 2022. H. Zhang, Y. Zhou, Y. Xue, Y. Liu, and J. Huang. G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations. In IEEE/ACM International Symposium on Microarchitecture, 2023. C. Zhou, P. Savarese, V. Richard, Z. Hassman, X. Yuan, M. Maire, M. DiBrino, and Y. Li. Sysmol: Co-designing algorithms and hardware for neural networks with heterogeneous precisions. arXiv, 2024. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In IEEE International Conference on Computer Vision, 2015.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97080	-
dc.description.abstract	在大型模型的訓練中，商品GPU伺服器（commodity GPU servers）通常依賴管線平行化（pipeline parallelism）將模型分散到多個 GPU 上，以降低單個GPU的記憶體峰值使用量。然而，當模型規模超過多個GPU的總記憶體容量時，模型的訓練仍會面對到記憶體容量限制的問題，此時使用主記憶體作為外部記憶體是一種有效的解決方案。將最先進的置換（swapping）技術應用於管線平行化時，即使透過這些技術，模型參數仍無法及時從主記憶體傳輸到GPU，進而導致延遲和管線氣泡（pipeline bubbles），嚴重影響訓練效能。為了解決此問題，本研究提出了一種新的機制，能夠將佇列中的計算動態地安排到置換操作產生的閒置中。透過重疊計算與資料傳輸，此機制有效減少計算資源閒置時間並提升整體資源利用率。實驗結果顯示，所提出的機制在訓練吞吐量方面可實現達1.18倍的提升。	zh_TW
dc.description.abstract	Large-scale model training on commodity GPU servers relies on pipeline parallelism to distribute models across multiple GPUs and reduce peak memory usage. However, when the model size exceeds the combined memory capacity of the GPUs, leveraging host memory as external memory serves as an effective solution to overcome memory limitations. When applying state-of-the-art swapping techniques to pipeline parallelism, despite the assistant of these techniques, model parameters still failed to transfer from host memory to GPUs in time, results in I/O delays and pipeline bubbles, significantly degrading training performance. To alleviate this performance degradation, this research proposes a mechanism that dynamically schedules queued computations into the idle intervals introduced by swapping operations. By overlapping computation with data transfer, the mechanism reduces idling and enhances overall resource utilization. Experimental results demonstrate throughput improvements of up to 1.18× compared to baseline.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-26T16:21:26Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-02-26T16:21:27Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 中文摘要 iii 英文摘要 iv 目次 v 圖次 vii 第一章 Introduction 1 第二章 Background 4 2.1 Pipeline Parallelism 4 2.2 Swapping Mechanism 6 第三章 Observation 8 第四章 Methodology 13 4.1 Bubbles in the Interleaved-1F1B 13 4.2 Assign Computations 14 第五章 Evaluation 17 第六章 Related Work 20 6.1 Tensor Movement and Swapping in Heterogeneous Memory Systems 20 6.2 Offloading and Distributed Memory Optimization 21 6.3 Distributed Training 22 6.3.1 Data Parallelism 22 6.3.2 Tensor Parallelism 23 6.3.3 Pipeline Parallelism 23 6.3.4 Sequence Parallelism 24 6.4 Mixed-precision Training and Quantization 25 第七章 Conclusion 27 參考文獻 28	-
dc.language.iso	en	-
dc.subject	深度學習訓練	zh_TW
dc.subject	記憶體管理	zh_TW
dc.subject	管線平行化	zh_TW
dc.subject	分散式訓練	zh_TW
dc.subject	高效能運算	zh_TW
dc.subject	deep learning training	en
dc.subject	memory management	en
dc.subject	distributed training	en
dc.subject	pipeline parallelism	en
dc.subject	high performance computing	en
dc.title	針對通用圖形處理器伺服器的大規模模型高效訓練之計算排程最佳化	zh_TW
dc.title	Optimizing Computation Scheduling for Efficient Large-Scale Model Training on Commodity GPU Servers	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	張原豪;鄭湘筠	zh_TW
dc.contributor.oralexamcommittee	Yuan-Hao Chang;Hsiang-Yun Cheng	en
dc.subject.keyword	深度學習訓練,高效能運算,分散式訓練,管線平行化,記憶體管理,	zh_TW
dc.subject.keyword	deep learning training,high performance computing,distributed training,pipeline parallelism,memory management,	en
dc.relation.page	33	-
dc.identifier.doi	10.6342/NTU202500469	-
dc.rights.note	未授權	-
dc.date.accepted	2025-02-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 未授權公開取用	1.13 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。