Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97080
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor楊佳玲zh_TW
dc.contributor.advisorChia-Lin Yangen
dc.contributor.author許程翔zh_TW
dc.contributor.authorCheng-Hsiang Hsuen
dc.date.accessioned2025-02-26T16:21:27Z-
dc.date.available2025-02-27-
dc.date.copyright2025-02-26-
dc.date.issued2025-
dc.date.submitted2025-02-11-
dc.identifier.citationL. Bertaccini, G. Paulin, T. Fischer, S. Mach, and L. Benini. Minifloat-nn and exsdotp: An isa extension and a modular open hardware unit for low-precision training on risc-v cores. arXiv, 2022.
Y. Bondarenko, M. Nagel, and T. Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. arXiv, 2021.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv, 2024.
J. Fang and S. Zhao. A unified sequence parallelism approach for long context generative ai. arXiv, 2024.
J. Fang, Z. Zhu, S. Li, H. Su, Y. Yu, J. Zhou, and Y. You. Parallel training of pretrained models via chunk-based dynamic memory management. IEEE Transactions on Parallel and Distributed Systems, 2023.
M. Hildebrand, J. Khan, S. Trika, J. Lowe-Power, and V. Akella. Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
C.-C. Huang, G. Jin, and J. Li. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 2019.
S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhandari, and Y. He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv, 2023.
J. Jung, J. Kim, and J. Lee. Deepum: Tensor migration and prefetching in unified memory. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023.
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv, 2020.
S. Li and T. Hoefler. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You. Sequence parallelism: Long sequence training from system perspective. arXiv, 2021.
S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv, 2020.
Z. Liu, Y. Wang, K. Han, S. Ma, and W. Gao. Instance-aware dynamic neural network quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. arXiv, 2018.
J. Moon, D. Kim, J. Cheon, and B. Ham. Instance-aware group quantization for vision transformers. arXiv, 2024.
S. R. Nandakumar, M. Le Gallo, C. Piveteau, V. Joshi, G. Mariani, I. Boybat, G. Karunaratne, R. Khaddam-Aljameh, U. Egger, A. Petropoulos, T. Antonakopoulos, B. Rajendran, A. Sebastian, and E. Eleftheriou. Mixed-precision deep learning based on computational memory. Frontiers in Neuroscience, 2020.
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In ACM Symposium on Operating Systems Principles, 2019.
D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, 2021.
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
NVIDIA. Amp, https://developer.nvidia.com/automatic-mixed-precision, 2022.
NVIDIA. Nvidia rtx4090 gpu, https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/ 2022.
NVIDIA. Nvidia h100 tensor core gpu, https://www.nvidia.com/en-us/data-center/h200/, 2024.
NVIDIA, P. Vingelmann, and F. H. Fitzek. Cuda, https://developer.nvidia.com/cuda-toolkit, 2022.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.
X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian. Capuchin: Tensor-based gpu memory management for deep learning. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
PyTorch. Amp, https://pytorch.org/docs/stable/amp.html, 2020.
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In IEEE International Symposium on High-Performance Computer Architecture, 2021.
J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. Zero-offload: Democratizing billion-scale model training. In USENIX Annual Technical Conference, 2021.
M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In IEEE/ACM International Symposium on Microarchitecture, 2016.
M. Rusci, A. Capotondi, and L. Benini. Memory-driven mixed low precision quantization for enabling deep network inference on microcontrollers. arXiv, 2019.
S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv, 2022.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv, 2023.
A. Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
B. Wang, Q. Xu, Z. Bian, and Y. You. Tesseract: Parallelize the tensor parallelism efficiently. In International Conference on Parallel Processing, 2022.
G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He. Zero++: Extremely efficient collective communication for giant model training. arXiv, 2023.
Q. Xu, S. Li, C. Gong, and Y. You. An efficient 2d method for training super-large deep learning models. arXiv, 2021.
Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv, 2022.
H. Zhang, Y. Zhou, Y. Xue, Y. Liu, and J. Huang. G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations. In IEEE/ACM International Symposium on Microarchitecture, 2023.
C. Zhou, P. Savarese, V. Richard, Z. Hassman, X. Yuan, M. Maire, M. DiBrino, and Y. Li. Sysmol: Co-designing algorithms and hardware for neural networks with heterogeneous precisions. arXiv, 2024.
Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In IEEE International Conference on Computer Vision, 2015.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97080-
dc.description.abstract在大型模型的訓練中,商品GPU伺服器(commodity GPU servers)通常依賴管線平行化(pipeline parallelism)將模型分散到多個 GPU 上,以降低單個GPU的記憶體峰值使用量。然而,當模型規模超過多個GPU的總記憶體容量時,模型的訓練仍會面對到記憶體容量限制的問題,此時使用主記憶體作為外部記憶體是一種有效的解決方案。將最先進的置換(swapping)技術應用於管線平行化時,即使透過這些技術,模型參數仍無法及時從主記憶體傳輸到GPU,進而導致延遲和管線氣泡(pipeline bubbles),嚴重影響訓練效能。為了解決此問題,本研究提出了一種新的機制,能夠將佇列中的計算動態地安排到置換操作產生的閒置中。透過重疊計算與資料傳輸,此機制有效減少計算資源閒置時間並提升整體資源利用率。實驗結果顯示,所提出的機制在訓練吞吐量方面可實現達1.18倍的提升。zh_TW
dc.description.abstractLarge-scale model training on commodity GPU servers relies on pipeline parallelism to distribute models across multiple GPUs and reduce peak memory usage. However, when the model size exceeds the combined memory capacity of the GPUs, leveraging host memory as external memory serves as an effective solution to overcome memory limitations. When applying state-of-the-art swapping techniques to pipeline parallelism, despite the assistant of these techniques, model parameters still failed to transfer from host memory to GPUs in time, results in I/O delays and pipeline bubbles, significantly degrading training performance. To alleviate this performance degradation, this research proposes a mechanism that dynamically schedules queued computations into the idle intervals introduced by swapping operations. By overlapping computation with data transfer, the mechanism reduces idling and enhances overall resource utilization. Experimental results demonstrate throughput improvements of up to 1.18× compared to baseline.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-26T16:21:26Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-02-26T16:21:27Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 i
誌謝 ii
中文摘要 iii
英文摘要 iv
目次 v
圖次 vii
第一章 Introduction 1
第二章 Background 4
2.1 Pipeline Parallelism 4
2.2 Swapping Mechanism 6
第三章 Observation 8
第四章 Methodology 13
4.1 Bubbles in the Interleaved-1F1B 13
4.2 Assign Computations 14
第五章 Evaluation 17
第六章 Related Work 20
6.1 Tensor Movement and Swapping in Heterogeneous Memory Systems 20
6.2 Offloading and Distributed Memory Optimization 21
6.3 Distributed Training 22
6.3.1 Data Parallelism 22
6.3.2 Tensor Parallelism 23
6.3.3 Pipeline Parallelism 23
6.3.4 Sequence Parallelism 24
6.4 Mixed-precision Training and Quantization 25
第七章 Conclusion 27
參考文獻 28
-
dc.language.isoen-
dc.subject深度學習訓練zh_TW
dc.subject記憶體管理zh_TW
dc.subject管線平行化zh_TW
dc.subject分散式訓練zh_TW
dc.subject高效能運算zh_TW
dc.subjectdeep learning trainingen
dc.subjectmemory managementen
dc.subjectdistributed trainingen
dc.subjectpipeline parallelismen
dc.subjecthigh performance computingen
dc.title針對通用圖形處理器伺服器的大規模模型高效訓練之計算排程最佳化zh_TW
dc.titleOptimizing Computation Scheduling for Efficient Large-Scale Model Training on Commodity GPU Serversen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee張原豪;鄭湘筠zh_TW
dc.contributor.oralexamcommitteeYuan-Hao Chang;Hsiang-Yun Chengen
dc.subject.keyword深度學習訓練,高效能運算,分散式訓練,管線平行化,記憶體管理,zh_TW
dc.subject.keyworddeep learning training,high performance computing,distributed training,pipeline parallelism,memory management,en
dc.relation.page33-
dc.identifier.doi10.6342/NTU202500469-
dc.rights.note未授權-
dc.date.accepted2025-02-12-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-liftN/A-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf
  未授權公開取用
1.13 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved