高效能卷積神經網路訓練系統加速晶片

"Kung, Chu King"; 江子近

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74121

標題:	高效能卷積神經網路訓練系統加速晶片 An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training
作者:	"Kung, Chu King" 江子近
指導教授:	闕志達(Tzi-Dar Chiueh)
關鍵字:	FloatSD,訓練,SOC,卷積神經網路,深度學習網路, FloatSD,Training,System on Chip (SOC),Convolutional Neural Network (CNN),Deep Neural Network (DNN),
出版年 :	2019
學位:	碩士
摘要:	人工智慧在近期的複蘇是基於深度學習的演算法。深層神經網絡（DNN）在許多電腦視覺的應用中已經超出人類用肉眼辨識的能力，例如物件檢測，圖像分類以及像AlphaGo遊戲之類的應用。深度學習的想法可以追溯到20世紀50年代，演算法上關鍵的突破發生在20世紀80年代。但是只有在最近的這幾年裡，才有強大的硬體資源來支援神經網絡的訓練。即使到了現在，機器學習依然在影響著各行各業，甚至於我們的生活。因此，針對深度學習演算法，去設計一個功能強大且高效的硬體加速器至關重要。執行深度學習演算法的硬體加速器必須有足夠的彈性，以支援各種結構的神經網路。舉例而言，GP-GPU之所以會被廣泛的應用於神經網路加速，就是因為使用者能夠在GPU上執行任意的程式碼。除了GP-GPU外，在過去的幾年裡，學界甚至業界已經開始投入到硬體加速相關的研究當中。一般而言，ASIC晶片的效率會比GP-GPU還要好10倍以上。然而，現有的加速器主要還是集中在推理上。但是，隨著技術的發展，神經網路的訓練也會逐漸的成為趨勢。與推理不同，訓練需要高動態範圍才能保證好的訓練效果。在這篇論文裡面，我們設計了Floating-Point Signed Digit（FloatSD）的數字表示格式，目的是要降低卷積神經網路（CNN）推理和訓練時的計算複雜度。通過同時設計數字表示方式以及相關的電路，我們希望可以在保證準確度的情況下達到省電，省面積的目的。本論文著重在，基於FloatSD數字表示格式SOC系統的設計。該系統可以用於AI的訓練和推理。 SOC由AI IP，DDR3控制器以及ARC HS34 CPU等組成，各個元件通過內部的AXI / AHB AMBA Bus Fabric進行溝通。 CPU可以通過AHB從端口對平台進行編程，以支援各種神經網路的模型。我們將已完成的SOC在HAPS-80 FPGA平台上進行各種不同的測試和驗證。之後，再使用標準的數位元件合成，自動佈局佈線（APR）流程來下線一個28nm的晶片。晶片正常的操作條件下（400MHz）下，能夠達到1.38TFLOPS的速度以及2.34TFLOPS / W的效率。 The recent resurgence of artificial intelligence is due to advances in deep learning. Deep neural network (DNN) has exceeded human capability in many computer vision applications, such as object detection, image classification and playing games like Go. The idea of deep learning dates back to as early as the 1950s, with the key algorithmic breakthroughs occurred in the 1980s. Yet, it has only been in the past few years, that powerful hardware accelerators became available to train neural networks. Even now, the demand for machine learning algorithms is still increasing; and it is affecting almost every industry. Therefore, designing a powerful and efficient hardware accelerator for deep learning algorithms is of critical importance for the time being. The accelerators that run the deep learning algorithm must be general enough to support deep neural networks with various computational structures. For instance, general-purpose graphics processing units (GP-GPUs) were widely adopted for deep learning tasks ever since they allow users to execute arbitrary code on them. Other than graphics processing units, researchers have also paid a lot of attention to hardware acceleration of deep neural networks (DNNs) in the last few years. Google developed its own chip called the Tensor Processing Unit (TPU) to power its own machine learning services [8]; while Intel unveiled its first generation of ASIC processor, called Nervana, for deep learning a few years ago [9]. ASICs usually provide a better performance, compared with FPGA and software implementations. Nevertheless, existing accelerators mostly focus on inference. However, local DNN training is still required to meet the needs of new applications, such as incremental learning and on-device personalization. Unlike inference, training requires high dynamic range in order to deliver high learning quality. In this work, we introduce the floating-point signed digit (FloatSD) data representation format for reducing computational complexity required for both the inference and the training of a convolutional neural network (CNN). By co-designing data representation and circuit, we demonstrate that we can achieve high raw performance and optimal efficiency – both energy and area – without sacrificing the quality of training. This work focuses on the design of FloatSD based system on chip (SOC) for AI training and inference. The SOC consists of an AI IP, integrated DDR3 controller and ARC HS34 CPU through AXI/AHB standard AMBA interfaces. The platform can be programmed by the CPU via the AHB slave port to fit various neural network topologies. The completed SOC has been tested and validated on the HAPS-80 FPGA platform. A synthesis and automated place and route (APR) flow is used to tape out a 28 nm test chip, after testing and verifying the correctness of the SOC. At its normal operating condition (e.g. 400MHz), the accelerator is capable of 1.38 TFLOPs peak performance and 2.34 TFLOPS/W.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74121
DOI:	10.6342/NTU201903036
全文授權:	有償授權
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	13.54 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。