請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74577
標題: | 低數值精確度捲積神經網路加速器之可重置化超大型積體電路設計 Reconfigurable Low Arithmetic Precision Convolution Neural Network Accelerator VLSI Design and Implementation |
作者: | En-Ho Shen 沈恩禾 |
指導教授: | 簡韶逸(Shao-Yi Chien) |
關鍵字: | 低數值精確度,捲積神經網路,加速器,可重置化,超大型積體電路設計, Reconfigurable,Low Arithmetic Precision,Convolution Neural Network,Accelerator,VLSI,CNN,Quantization, |
出版年 : | 2019 |
學位: | 碩士 |
摘要: | 近年來神經網路(DNN)在各項人工智慧應用獲得廣大成功與進步。但是通常這樣的模型需要笨重且耗電的通用顯示卡(GPGPU)幫助運算,不適合在行動裝置等使用電池的設備使用。在這篇論文,我們提出一個超大型積體電路設計,專門運算經過量化(quantization)的低運算精確度捲積神經網路(CNN),大大減少跨系統資料傳輸造成的能量消耗,特別適合行動裝置上的神經網路加速。我們首先提出一個簡單且有效的神經網路量化演算法,一個有著高度資料重複利用率且適合這樣經過量化神經網路的資料流動策略。為了發揮量化資料的最大潛力,我們設計了一個專門運算低運算精確度資料的乘法加法樹結構,接著提出了一個晶片內緩存記憶體結構與資料重新排列方法,用以減少任何不必要的資料存取浪費和記憶體分塊衝突(bank-conflict),最後是一組接受從緩存記憶體廣播(broadcast)到各個運算單位的資料,並且將完成的結果依序傳回全域緩存記憶體(global buffer)的核心運算單元陣列。我們提出的架構能支援絕大部分的捲積神經網路構造,並且能夠重置(reconfigure)適當的運算資料精確度,適應各種量化的神經網路結構。最後的設計使用了 180KB 的晶片內記憶體,和 1340K 的邏輯閘 Deep neural networks (DNNs) shows promising results on various AI application tasks. However such networks typically are executed on general purpose GPUs with bulky size in form factor and hundreds of watt in power consumption, which unsuitable for mobile applications. In this thesis, we present a VLSI architecture able to process on quantized low numeric-precision convolution neural networks (CNNs), cutting down on power consumption from memory access and speeding the model up with limited area budget,particularlyfitformobiledevices.We first propose a quantization re-trainig algorithm for trainig low-precision CNN, then a dataflow with high data reuse rate with a specially data multiplication accumulation strategy specially designed for such quantized model. To fully utilize the efficiency of computation with such low-precision data, we design a micro-architecture for low bit-length multiplication and accumulation, then a on-chip memory hierarchy and data re-alignment flow for power saving and avoiding buffer bank-conflicts, and a PE array designed for taking broadcast-ed data from buffer and sending out finished data sequentially back to buffer for such dataflow. The architecture is highly flexible for various CNN shaped and re-configurable for low bit-length quantized models. The design synthesised with a 180KB on-chip memory capacity and a 1340k logic gate counts area, the implementation resultshows state-of-the-art hardware efficiency. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74577 |
DOI: | 10.6342/NTU201902618 |
全文授權: | 有償授權 |
顯示於系所單位: | 電子工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-1.pdf 目前未授權公開取用 | 16.96 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。