在FPGA上實現無乘法器卷積神經網絡推理加速電路

Ming-Hang Hsieh; 謝明航

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68072

標題:	在FPGA上實現無乘法器卷積神經網絡推理加速電路 A Multiplier-less Convolution Neural Network Inference Acceleration Engine Based on FPGA
作者:	Ming-Hang Hsieh 謝明航
指導教授:	闕志達(Tzi-Dar Chiueh)
關鍵字:	機器學習,卷積神經網路,推論加速系統,FloatSD,FloatSD4,無乘法器,半精度累加, machine learning,convolutional neural network,inference acceleration system,FloatSD,FloatSD4,multiplier-less,half-precision accumulation,
出版年 :	2020
學位:	碩士
摘要:	自2012年AlexNet [1]公開以來，機器學習所能應用的層面越來越廣泛，無論是早期的影像分類、物件辨識，還是中期的風格轉換[2]、自然語言處理[3]，甚至到近期的影音生成[4][5]，機器學習已展顯了它在各種領域的潛力及應用。而上述這些應用，大部分都有一項共通的特點，那便是卷積神經網路的使用。卷積神經網路已成為機器學習領域中不可或缺的一部份，因此運算速度提升的需求便隨之增加。無論是雲端運算還是終端運算，如何以更低的功耗和更有效率的方式，去進行神經網路的推論加速，便是近年研究的重點之一。在卷積神經網路的推論過程中，會需要大量的乘加運算，而這些運算在同一層網路中，並沒有數學上的相依性，因此對於傳統的CPU來說，即使能使用向量運算的指令集進行加速，也仍會顯得吃力。而基於圖形處理器通用計算(General-purpose computing on graphics processing units, GPGPU)的硬體加速就能很好地解決這個問題。然而，GPGPU因其發展歷史和通用的特性，使得它雖然可以平行處理卷積運算，卻不能好好地利用卷積神經網路獨有的資料共用特性，所有運算皆須經過適度的轉換及排列，才能使用GPGPU的矩陣運算功能進行加速，這也使得它的執行效率並不高，大量的能源消耗也使得它在終端裝置上顯得不切實際。本論文基於Floating-point Signed Digit (FloatSD)演算法[6]，提出更精簡的4-bit FloatSD4權重編碼，除了大幅降低神經網路的資料傳輸量，也使得神經網路卷積運算從乘加運算化簡為加法運算，顯著地降低運算複雜度。而在三種影像辨認的資料集: MNIST、CIFAR-10和ImageNet中，MNIST和CIFAR-10達到了與FP32相近的結果，ImageNet的top-1和top-5的正確率與FP32差異，皆在0.5%以下。除了軟體的訓練結果外，本論文的另一個重點便是針對FloatSD4演算法設計的硬體電路，除了核心的加速運算單元外，亦有基於FPGA和PC平台的推論加速系統。本論文以VGG-7作為驗證系統可行性的神經網路，相較於單精度運算的CPU平台，基於FPGA的加速系統運算速度提升了4.82倍，整體的能源效率更是CPU的80倍。 Since the publication of AlexNet [1] in 2012, the applications of machine learning have become more and more extensive. Early applications include image classification, object recognition. Later, there were style transfer [2] and natural language processing [3], and more recently video and audio generation [4][5]. Machines Learning has shown its potential and applications in various fields. Most of these applications have one thing in common, which is the use of convolutional neural networks (CNN). Convolutional neural networks have become an indispensable part of the field of machine learning, so the need to enhance its execution speed has increased. Whether it is cloud computing or edge computing, how to accelerate neural network inference with low-power and high-efficiency is a major research focus in recent years. In the inference process of CNNs, a large number of multiply-add operations are required. CPU execution of these operations can leverage vector operations for acceleration, albeit still not very inefficient. General-purpose computing by graphics processing units (GPGPU) has been known to solve this problem well. However, due to its development history and general characteristics, GPGPU can process convolution operations in parallel, but it cannot make good use of the unique data sharing characteristics of CNNs. All operations must undergo proper conversion and arrangement for them to be accelerated by matrix operations. These extra steps make GPGPU less efficient and high energy consumption makes GPGPU impractical for edge devices. Based on the Floating-point Signed Digit (FloatSD) algorithm [6], this thesis proposed a simplified 4-bit FloatSD4 weight encoding. In addition to greatly reducing the weight transmission and storage, it also reduces the convolution operation from multiply-and-add operation to one simple addition, which significantly reduces the associated computational complexity. Among the three image recognition data sets: MNIST, CIFAR-10, and ImageNet, MNIST and CIFAR-10 CNNs with FloatSD4 weights achieve similar results to their FP32 counterparts. The top-1 and top-5 accuracies of ImageNet by FloatSD4 CNN also achieve very good results, where both accuracies are within 0.5% of that by FP32 CNN. In addition to training CNNs with FloatSD4 weight effectively, the other focus of this thesis is the inference acceleration circuit design for the FloatSD4 CNNs. The proposed FloatSD4 inference acceleration system is based on FPGA and PC platform, where many FloatSD4 convolution processing elements, associated caches, and access circuitry are implemented in the FPGA. Besides, we used VGG-7 on CIFAR-10 application to verify the inference acceleration system. Compared to the CPU platform running single-prevision arithmetics, the acceleration system computing speed is increased by 4.82 times, while the overall energy efficiency is 80 times that of the CPU.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68072
DOI:	10.6342/NTU202003812
全文授權:	有償授權
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
U0001-1708202017074300.pdf 目前未授權公開取用	5.51 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。