基於向量量化之節能神經網路加速器架構設計

Yi-Min Chih; 池翊忞

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72627

Title:	基於向量量化之節能神經網路加速器架構設計 Energy-Efficient Neural Network Accelerator with Vector Quantization
Authors:	Yi-Min Chih 池翊忞
Advisor:	簡韶逸(Shao-Yi Chien)
Keyword:	向量量化,線性量化,神經網路,加速器,硬體架構, vector quantization,linear quantization,neural network,accelerator,hardware architecture,
Publication Year :	2020
Degree:	碩士
Abstract:	線性量化(linear quantization)是在神經網絡推論系統(inference systems of neural networks)實現中常用的壓縮模型技術，廣泛用於VLSI硬件體系結構設計中。其中，可以使用量化训练(quantization-aware training)透過端到端微調訓練改善量化模型後的正確率下降問題。另一方面，較少被提及的向量量化(vector quantization)雖然有快速推論的方法，但是缺乏了端到端的微調訓練技術，導致正確率下降。在這本篇論文中，我們提出一個向量量化訓練(vector-quantization-aware training)技術，可以使用任何向量量化參數對模型端到端的微調訓練。另外，我們也結合了無優化線性量化進一步對模型進行壓縮，也代表者其他改進的線性量化方法也可以改進我們的方法以達到更好的結果。此外，我們設計了一個基於向量量化的高效能低延遲硬體架構，並且可以支援卷積層(convolution layer)、深度卷積層(depthwise convolution layer)及全連接層(fully connected layer)。 For the implementation of inference systems of neural networks, linear quantization (LQ) is a common technique to compress the model and is widely employed on VLSI hardware architecture design. The quantization-aware training (QAT) also improves the degradation of the quantized model with end-to-end finetuning. On the other hand, vector quantization (VQ), which is a multi-dimensional non-linear quantization method, is rarely mentioned in literatures. The efficient inference scheme with VQ has been proposed, but the accuracy drop is significant because of the lack of end-to-end finetuning technique for VQ. In this thesis, we propose the vector quantization-aware training (VQAT) technique that can end-to-end finetune the model with any VQ parameters. In addition, we combine the vanilla LQ to compress the model further, meaning that any other improved LQ method can also be combined with the proposed VQAT+LQ scheme to achieve a better result. We also design a hardware architecture that can efficiently perform vector quantized neural networks with high performance and low latency on convolution (CONV), depthwise (DW), and fully connected (FC) layers. Experimental results show that the VQAT+LQ scheme can compress the model by 1.16x to 1.18x compared with Quantized CNN (QCNN) but still have a lower accuracy drop than QCNN. Also, in our proposed hardware architecture, the VQAT+LQ further reduces the DRAM access by 1.1x to 1.3x compared with the VQAT.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72627
DOI:	10.6342/NTU202100044
Fulltext Rights:	有償授權
Appears in Collections:	電子工程學研究所

Files in This Item:

File	Size	Format
U0001-1101202116382500.pdf Restricted Access	6.01 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets