Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84656
Title: 多GPU系統的大型神經網路訓練
Large NN Model Support in Multi-GPU System
Authors: Shao-Fu Lin
林芍甫
Advisor: 楊佳玲(Chia-Lin Yang)
Keyword: 大型模型訓練,GPU,資料平行,
Large Model Training,GPU,Data Parallelism,
Publication Year : 2022
Degree: 碩士
Abstract: 隨著深度神經網絡 (DNN) 模型越來越加深與加大,克服有限的 GPU 記憶體容量成為訓練大規模神經網絡的主要挑戰之一。應對這一挑戰的一種常用解決方案是利用主記憶體作為外部記憶體,對於 GPU 記憶體進行張量交換(Tensor Swapping) 。但是,由於 PCIe 通道競爭(Channel Contention),在資料平行的訓練系統中,張量交換機制的有效性可能會受到影響。在本文中,我們提出第一個的大型模型訓練框架,該框架協調 GPU 之間的張量交換,從而減輕了 PCIe 的通道競爭。我們設計了兩種類型的協調機制。第一種機制同步不同 GPU 中的執行,以避免同時發出張量交換指令。在第二種機制中,透過為每個 GPU 選擇不相交的(Disjoint)張量,使共享 PCIe 通道的搬移可以錯開。這些方法的有效性取決於 GPU 需要多久同步一次梯度(Gradient)。實驗結果表明,與忽略通道競爭的大型模型訓練相比,所提出的解決方案平均實現了 15% 的加速。
As deep neural networks (DNNs) models are growing deeper and wider, overcoming the limited GPU memory capacity becomes one of the main challenges for training large-scale neural networks. One commonly used solution for this challenge is utilizing the host memory as the external memory for swapping tensors in and out of GPU memories. However, the effectiveness of the tensor swapping mechanism could be impaired in a data-parallel training system due to the contention on the shared PCIe channel to the host. In this paper, we propose the first large model support framework that coordinates the tensor movements among GPUs such that the PCIe channel contention is alleviated. We design two types of coordination mechanisms. The first one synchronizes thread executions in different GPUs to avoid issuing tensor swapping commands at the same time. In the second mechanism, the shared PCIe channel accesses from different GPUs are interleaved via selecting disjoint swapped-out tensors for each GPU. The effectiveness of these two methods depends on how often GPUs need to synchronize on gradients. The experimental results show that, compared to the large model support oblivious of channel contention, the proposed solution achieves 15% speedup on average.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84656
DOI: 10.6342/NTU202201995
Fulltext Rights: 同意授權(限校園內公開)
metadata.dc.date.embargo-lift: 2022-09-16
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
U0001-0308202200204800.pdf
Access limited in NTU ip range
1.89 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved