多GPU系統的大型神經網路訓練

Shao-Fu Lin; 林芍甫

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84656

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲(Chia-Lin Yang)
dc.contributor.author	Shao-Fu Lin	en
dc.contributor.author	林芍甫	zh_TW
dc.date.accessioned	2023-03-19T22:19:18Z	-
dc.date.copyright	2022-09-16
dc.date.issued	2022
dc.date.submitted	2022-09-14
dc.identifier.citation	[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [2] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” 2016. [3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., 2020. [5] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2019. [6] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang,Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” 2020. [7] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., 2020. [8] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” 2019. [9] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, Association for Computing Machinery, 2021. [10] A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,” 2021. [11] NVIDIA, “Nvidia h100 tensor core gpu.” https://www.nvidia.com/en-us/ data-center/h100/, 2022. [12] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “Vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’49, IEEE Press, 2016. [13] L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska, “Superneurons: Dynamic gpu memory management for training deep neural networks,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, Association for Computing Machinery, 2018. [14] X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, “Capuchin: Tensor-based gpu memory management for deep learning,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Association for Computing Machinery, 2020. [15] J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, “Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning,” in Proceedings of the international Symposium on High-Performance Computer Architecture, HPCA ’21, IEEE, 2021. [16] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed: Experiences on accelerating data parallel training,” 2020. [17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019. [18] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito, S. Suzuki, K. Uenishi, B. Vogel, and H. Yamazaki Vincent, “Chainer: A deep learning framework for accelerating the research cycle,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’19, Association for Computing Machinery, 2019. [19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI ’16, USENIX Association, 2016. [20] A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J. Levenberg, M. Hong, R. Monga, and S. Cai, “Tensorflow eager: A multi-stage, python-embedded dsl for machine learning,” 2019. [21] M. Hildebrand, J. Khan, S. Trika, J. Lowe-Power, and V. Akella, “Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Association for Computing Machinery, 2020. [22] J. Luitjens, “Cuda streams best practices and common pitfalls.” https://on-demand.gputechconf.com/gtc/2014/presentations/ S4158-cuda-streams-best-practices-common-pitfalls.pdf, 2014. [23] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: Generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, Association for Computing Machinery, 2019. [24] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server,” in Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, Association for Computing Machinery, 2016. [25] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” 2016. [26] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman, “Mesh-tensorflow: Deep learning for supercomputers,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS ’18, Curran Associates Inc., 2018. [27] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” 2018. [28] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20, IEEE Press, 2020. [29] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” 2014. [30] M. Martinasso, G. Kwasniewski, S. R. Alam, T. C. Schulthess, and T. Hoefler, “A pcie congestion-aware performance model for densely populated accelerator servers,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, IEEE Press, 2016 [31] NVIDIA, “Nvidia dgx-1 with tesla v100 system architecture.” https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper. pdf, 2017. [32] B. Mutnury, F. Paglia, J. Mobley, G. K. Singh, and R. Bellomio, “Quickpath interconnect (qpi) design and analysis in high speed servers,” in Proceedings of the 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems, EPEPS ’10, IEEE Press, 2010. [33] Supermicro, “4u gpu system - pcie root architectures.” https://www.supermicro.org.cn/products/system/4U/4029/PCIe-Root-Architecture.cfm, 2019. [34] Supermicro, “4u dual processor (intel), dual-root gpu system with up to 8 pci-e gpus.” https://www.supermicro.com/en/products/system/4U/4029/SYS-4029GP-TRT.cfm, 2019. [35] GIGABYTE, “User manual, dual lga3647 sockets motherboard for intel xeon scalable family processors.” https://www.gigabyte.com/Enterprise/GPU-Server/G291-281-rev-100, 2019. [36] Supermicro, “4u dual processor (intel), single-root gpu system with up to 8 pci-e gpus.” https://www.supermicro.com/en/products/system/4U/4029/SYS-4029GP-TRT2.cfm, 2019. [37] NVIDIA, “Nvlink and nvswitch.” https://www.nvidia.com/en-us/data-center/nvlink/, 2014. [38] G. Kim, M. Lee, J. Jeong, and J. Kim, “Multi-gpu system design with memory networks,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’14, IEEE Computer Society, 2014. [39] M. Amaral, J. Polo, D. Carrera, S. Seelam, and M. Steinder, “Topology-aware gpu scheduling for learning workloads in cloud environments,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, Association for Computing Machinery, 2017. [40] G. P. Mccormick, “Computability of global solutions to factorable nonconvex programs: Part i – convex underestimating problems,” 1976. [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020. [42] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” 2022. [43] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018. [44] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019. [45] NVIDIA, P. Vingelmann, and F. H. Fitzek, “Cuda.” https://developer.nvidia.com/cuda-toolkit, 2022. [46] PyTorch, “Pytorch profiler.” https://pytorch.org/docs/stable/profiler.html, 2022. [47] Gurobi Optimization, LLC, “Gurobi Optimizer Reference Manual,” 2022. [48] A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko, “Gist: Efficient data encoding for deep neural network training,” in Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA ’18, IEEE Press, 2018. [49] D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia, “Memory-efficient pipeline-parallel dnn training,” 2020. [50] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, “Zero-offload: Democratizing billion-scale model training,” in Proceedings of 2021 USENIX Annual Technical Conference, USENIX ATC ’21, USENIX Association, 2021. [51] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016. [52] NVIDIA, “Unified memory for cuda beginners.” https://developer.nvidia.com/blog/unified-memory-cuda-beginners/, 2017. [53] D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory,” in Proceedings of the 46th International Symposium on Computer Architecture, ISCA ’19, Association for Computing Machinery, 2019. [54] H. Muthukrishnan, D. Lustig, D. Nellans, and T. Wenisch, “Gps: A global publishsubscribe model for multi-gpu memory management,” in Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’21, Association for Computing Machinery, 2021. [55] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device placement optimization with reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ICML ’17, JMLR, 2017. [56] M. Wang, C.-c. Huang, and J. Li, “Supporting very large models using automatic dataflow graph partitioning,” in Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19, Association for Computing Machinery, 2019. [57] S. Eliad, I. Hakimi, A. D. Jagger, M. Silberstein, and A. Schuster, “Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism,” in Proceedings of 2021 USENIX Annual Technical Conference, USENIX ATC ’21, USENIX Association, 2021. [58] Y. Li, A. Phanishayee, D. Murray, J. Tarnawski, and N. S. Kim, “Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers,” 2022.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84656	-
dc.description.abstract	隨著深度神經網絡 (DNN) 模型越來越加深與加大，克服有限的 GPU 記憶體容量成為訓練大規模神經網絡的主要挑戰之一。應對這一挑戰的一種常用解決方案是利用主記憶體作為外部記憶體，對於 GPU 記憶體進行張量交換(Tensor Swapping) 。但是，由於 PCIe 通道競爭(Channel Contention)，在資料平行的訓練系統中，張量交換機制的有效性可能會受到影響。在本文中，我們提出第一個的大型模型訓練框架，該框架協調 GPU 之間的張量交換，從而減輕了 PCIe 的通道競爭。我們設計了兩種類型的協調機制。第一種機制同步不同 GPU 中的執行，以避免同時發出張量交換指令。在第二種機制中，透過為每個 GPU 選擇不相交的(Disjoint)張量，使共享 PCIe 通道的搬移可以錯開。這些方法的有效性取決於 GPU 需要多久同步一次梯度(Gradient)。實驗結果表明，與忽略通道競爭的大型模型訓練相比，所提出的解決方案平均實現了 15% 的加速。	zh_TW
dc.description.abstract	As deep neural networks (DNNs) models are growing deeper and wider, overcoming the limited GPU memory capacity becomes one of the main challenges for training large-scale neural networks. One commonly used solution for this challenge is utilizing the host memory as the external memory for swapping tensors in and out of GPU memories. However, the effectiveness of the tensor swapping mechanism could be impaired in a data-parallel training system due to the contention on the shared PCIe channel to the host. In this paper, we propose the first large model support framework that coordinates the tensor movements among GPUs such that the PCIe channel contention is alleviated. We design two types of coordination mechanisms. The first one synchronizes thread executions in different GPUs to avoid issuing tensor swapping commands at the same time. In the second mechanism, the shared PCIe channel accesses from different GPUs are interleaved via selecting disjoint swapped-out tensors for each GPU. The effectiveness of these two methods depends on how often GPUs need to synchronize on gradients. The experimental results show that, compared to the large model support oblivious of channel contention, the proposed solution achieves 15% speedup on average.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T22:19:18Z (GMT). No. of bitstreams: 1 U0001-0308202200204800.pdf: 1940221 bytes, checksum: db2791c6ed92d1bc81e5db3c2ac22af3 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	摘要 . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . xi Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Background . . . . . . . . . . . . . . . . . . . . 7 2.1 DNN Models and DNN Training . . . . . . . . . . . . . . . . . . . . 7 2.2 Tensor Swapping for Large Model Support . . . . . . . . . . . . . . 8 2.3 Swapping with CUDA Asynchronous Semantics . . . . . . . . . . . 9 2.4 Distributed Model Training . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Interconnection of Modern GPU system . . . . . . . . . . . . . . . . 12 Chapter 3 Tensor Movement Orchestration Framework . . . . . . . . . . . . . . . . . . . . 15 3.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Online/Offline Decision Engine . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Online Heuristic Policy . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Offline MIP Formulation . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Channel Contention Avoidance . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Coarse-Grain Interleaving - Bidirectional Overlapping . . . . . . . 25 3.3.2 Fine-Grain Interleaving - Disjoint Swapping . . . . . . . . . . . . . 29 3.3.2.1 Online Heuristic Policy . . . . . . . . . . . . . . . . . 30 3.3.2.2 Offline MIP Formulation . . . . . . . . . . . . . . . . 30 Chapter 4 Evaluation . . . . . . . . . . . . . . . . . . . . 33 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Offline Decision Engine in Single-GPU System . . . . . . . . . . . . 36 4.3 Contention Avoidance in Multi-GPU System . . . . . . . . . . . . . 37 4.3.1 Bidirectional Overlapping . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Disjoint Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.3 Bidirectional Overlapping vs. Disjoint Swapping . . . . . . . . . . 45 Chapter 5 Related Work . . . . . . . . . . . . . . . . . . . . 47 5.1 Single-GPU Large Model Training . . . . . . . . . . . . . . . . . . . 47 5.2 Distributed Model Training . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Training Large Model in Data-Parallel Multi-GPU system . . . . . . 50 Chapter 6 Future Work . . . . . . . . . . . . . . . . . . . . 53 6.1 Deferring Collective Communication . . . . . . . . . . . . . . . . . 53 Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . 55 References . . . . . . . . . . . . . . . . . . . . 57
dc.language.iso	en
dc.title	多GPU系統的大型神經網路訓練	zh_TW
dc.title	Large NN Model Support in Multi-GPU System	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.author-orcid	0000-0002-7352-0870
dc.contributor.oralexamcommittee	陳依蓉(Yi-jung Chen),葉宗泰(Tsung-Tai Yeh),鄭湘筠(Hsiang-Yun Cheng)
dc.subject.keyword	大型模型訓練,GPU,資料平行,	zh_TW
dc.subject.keyword	Large Model Training,GPU,Data Parallelism,	en
dc.relation.page	65
dc.identifier.doi	10.6342/NTU202201995
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2022-09-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
dc.date.embargo-lift	2022-09-16	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-0308202200204800.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.89 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。