低數值精確度捲積神經網路加速器之可重置化超大型積體電路設計

En-Ho Shen; 沈恩禾

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74577

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	簡韶逸(Shao-Yi Chien)
dc.contributor.author	En-Ho Shen	en
dc.contributor.author	沈恩禾	zh_TW
dc.date.accessioned	2021-06-17T08:43:42Z	-
dc.date.available	2019-08-18
dc.date.copyright	2019-08-18
dc.date.issued	2019
dc.date.submitted	2019-08-07
dc.identifier.citation	[1] E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Lognet: Energy-efﬁcient neural networks using logarithmic computation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 5900–5904. v, 6 [2] S.Han,H.Mao,andW.Dally,“Deepcompression: Compressingdeepneural networks with pruning, trained quantization and huffman coding,” 10 2016. v, 2, 6 [3] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classiﬁcation Using Binary Convolutional Neural Networks,” arXiv e-prints, p. arXiv:1603.05279, Mar 2016. v, 3, 6, 8, 18, 37, 44 [4] S. Migacz, “8-bit inference with tensorrt.” [Online]. Available: http://on-demand.gputechconf.com/gtc/2017/presentation/ s7310-8-bit-inference-with-tensorrt.pdf v, 8, 9, 44 [5] Y. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energy-efﬁcient reconﬁgurable accelerator for deep convolutional neural networks,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC), Jan 2016, pp. 262–263. [Online]. Available: http://people.csail.mit.edu/emer/slides/2016. 02.isscc.eyeriss.slides.pdf v, 10 [6] B. Moons and M. Verhelst, “A 0.3-2.6 TOPS/W precision-scalable processor for real-time large-scale convnets,” CoRR, vol. abs/1606.05094, 2016. [Online]. Available: http://arxiv.org/abs/1606.05094 v, 11, 26, 45 [7] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequencyscalable convolutional neural network processor in 28nm fdsoi,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), Feb 2017, pp. 246–247. v, 11, 26, 45 [8] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks,” CoRR, vol. abs/1712.01507, 2017. [Online]. Available: http://arxiv.org/abs/1712.01507 v, 12 [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv e-prints, p. arXiv:1512.03385, Dec 2015. 1 [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ 4824-imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf 1, 3 [11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE:EfﬁcientInferenceEngineonCompressedDeepNeuralNetwork,”arXiv e-prints, p. arXiv:1602.01528, Feb 2016. 1, 2, 25 [12] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefﬁcient dataﬂow for convolutional neural networks,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), June 2016, pp. 367–379. 2, 10, 12, 25, 26, 45 [13] J. Luo, J. Wu, and W. Lin, “Thinet: A ﬁlter level pruning method for deep neural network compression,” CoRR, vol. abs/1707.06342, 2017. [Online]. Available: http://arxiv.org/abs/1707.06342 2 [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861 2 [15] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufﬂenet: An extremely efﬁcient convolutional neural network for mobile devices,” CoRR, vol. abs/1707.01083, 2017. [Online]. Available: http://arxiv.org/abs/1707.01083 2 [16] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolutional neural networks for object recognition,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131– 1135, 2015. 3, 6 [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” CoRR, vol. abs/1409.0575, 2014. [Online]. Available: http://arxiv.org/abs/1409.0575 3 [18] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/ 6 [19] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.04711 6 [20] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http://arxiv.org/abs/1602.02830 6 [21] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/∼kriz/ cifar.html 6 [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167 19 [23] johnjohnlin, “Nicotb, a python-verilog co-simulation framework.” [Online]. Available: https://github.com/johnjohnlin/nicotb 44
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74577	-
dc.description.abstract	近年來神經網路(DNN)在各項人工智慧應用獲得廣大成功與進步。但是通常這樣的模型需要笨重且耗電的通用顯示卡(GPGPU)幫助運算，不適合在行動裝置等使用電池的設備使用。在這篇論文，我們提出一個超大型積體電路設計，專門運算經過量化(quantization)的低運算精確度捲積神經網路(CNN)，大大減少跨系統資料傳輸造成的能量消耗，特別適合行動裝置上的神經網路加速。我們首先提出一個簡單且有效的神經網路量化演算法，一個有著高度資料重複利用率且適合這樣經過量化神經網路的資料流動策略。為了發揮量化資料的最大潛力，我們設計了一個專門運算低運算精確度資料的乘法加法樹結構，接著提出了一個晶片內緩存記憶體結構與資料重新排列方法，用以減少任何不必要的資料存取浪費和記憶體分塊衝突(bank-conflict)，最後是一組接受從緩存記憶體廣播(broadcast)到各個運算單位的資料，並且將完成的結果依序傳回全域緩存記憶體(global buffer)的核心運算單元陣列。我們提出的架構能支援絕大部分的捲積神經網路構造，並且能夠重置(reconfigure)適當的運算資料精確度，適應各種量化的神經網路結構。最後的設計使用了 180KB 的晶片內記憶體，和 1340K 的邏輯閘	zh_TW
dc.description.abstract	Deep neural networks (DNNs) shows promising results on various AI application tasks. However such networks typically are executed on general purpose GPUs with bulky size in form factor and hundreds of watt in power consumption, which unsuitable for mobile applications. In this thesis, we present a VLSI architecture able to process on quantized low numeric-precision convolution neural networks (CNNs), cutting down on power consumption from memory access and speeding the model up with limited area budget,particularlyﬁtformobiledevices.We ﬁrst propose a quantization re-trainig algorithm for trainig low-precision CNN, then a dataﬂow with high data reuse rate with a specially data multiplication accumulation strategy specially designed for such quantized model. To fully utilize the efficiency of computation with such low-precision data, we design a micro-architecture for low bit-length multiplication and accumulation, then a on-chip memory hierarchy and data re-alignment ﬂow for power saving and avoiding buffer bank-conﬂicts, and a PE array designed for taking broadcast-ed data from buffer and sending out ﬁnished data sequentially back to buffer for such dataﬂow. The architecture is highly ﬂexible for various CNN shaped and re-conﬁgurable for low bit-length quantized models. The design synthesised with a 180KB on-chip memory capacity and a 1340k logic gate counts area, the implementation resultshows state-of-the-art hardware efficiency.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:43:42Z (GMT). No. of bitstreams: 1 ntu-108-R05943011-1.pdf: 17371880 bytes, checksum: 17625a0b89ce521d1fe05e93ed8fbcf0 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Abstract i List of Figures v List of Tables ix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work 5 2.1 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 ﬁxed point quantisation . . . . . . . . . . . . . . . . . . . 6 2.1.2 ternary to binary quantisation . . . . . . . . . . . . . . . 6 2.1.3 8-bit quantization on modern models . . . . . . . . . . . 8 2.2 Hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Dataﬂow optimization: row stationary . . . . . . . . . . . 10 2.2.2 Precision reconﬁgurable and sub-word parallelism arithmetic unit . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Bit-level re-conﬁgurable arithmetic unit . . . . . . . . . . 12 3 Low numeric precision convolution neural network 15 3.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 15 3.2 Low Precision CNN . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Quantization Loss Minimization Threshold Selection . . . . . . . 19 3.4 Computational consideration and data re-packing . . . . . . . . . 21 3.4.1 Data re-packing . . . . . . . . . . . . . . . . . . . . . . . 21 4 ProposedArchitecture 25 4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1 Output row stationary dataﬂow . . . . . . . . . . . . . . . 27 4.1.2 Data tiling . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.3 Data re-alignment and buffer hierarchy . . . . . . . . . . 30 4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.1 PE processing pipeline . . . . . . . . . . . . . . . . . . . 32 4.2.2 Sub-word accumulation operation and re-conﬁgurable arithmetic logic unit . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.3 Shift dispatcher . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.4 Quantization . . . . . . . . . . . . . . . . . . . . . . . . 42 5 Results 43 5.1 Quantization error minimization training . . . . . . . . . . . . . . 43 5.2 Implementation results . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1 Area and power . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Conclusion 53 Reference 55
dc.language.iso	zh-TW
dc.title	低數值精確度捲積神經網路加速器之可重置化超大型積體電路設計	zh_TW
dc.title	Reconfigurable Low Arithmetic Precision Convolution Neural Network Accelerator VLSI Design and Implementation	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蔡宗漢,吳安宇,楊家驤
dc.subject.keyword	低數值精確度,捲積神經網路,加速器,可重置化,超大型積體電路設計,	zh_TW
dc.subject.keyword	Reconfigurable,Low Arithmetic Precision,Convolution Neural Network,Accelerator,VLSI,CNN,Quantization,	en
dc.relation.page	58
dc.identifier.doi	10.6342/NTU201902618
dc.rights.note	有償授權
dc.date.accepted	2019-08-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	16.96 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。