高效能卷積神經網路訓練系統加速晶片

"Kung, Chu King"; 江子近

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74121

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	闕志達(Tzi-Dar Chiueh)
dc.contributor.author	"Kung, Chu King"	en
dc.contributor.author	江子近	zh_TW
dc.date.accessioned	2021-06-17T08:20:46Z	-
dc.date.available	2020-08-20
dc.date.copyright	2019-08-20
dc.date.issued	2019
dc.date.submitted	2019-08-13
dc.identifier.citation	[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, 'Learning representations by back-propagating errors,' Nature, vol. 323, no. 6088, pp. 533-536, 1986/10/01 1986. [2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, 'Gradient-based learning applied to document recognition,' Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998. [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, 'ImageNet Classification with Deep Convolutional Neural Networks,' NIPS, 2012. [4] C. Szegedy et al., 'Going deeper with convolutions,' in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9. [5] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image Recognition,' in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778. [6] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung, 'Accelerating Deep Convolutional Neural Networks Using Specialized Hardware,' 2015. [7] S. Mittal, A Survey of FPGA-based Accelerators for Convolutional Neural Networks. 2018. [8] N. P. Jouppi et al., 'In-datacenter performance analysis of a tensor processing unit,' pp. 1-12, 2017. [9] U. Köster et al., Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. 2017. [10] S. Han et al., 'EIE: Efficient Inference Engine on Compressed Deep Neural Network,' in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 243-254. [11] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, 'Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,' IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, 2017. [12] Z. Yuan et al., 'Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers,' in 2018 IEEE Symposium on VLSI Circuits, 2018, pp. 33-34. [13] J. Song et al., '7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC,' in 2019 IEEE International Solid- State Circuits Conference - (ISSCC), 2019, pp. 130-132. [14] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. Yoo, 'UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision,' IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 173-185, 2019. [15] S. Yin et al., 'An Ultra-High Energy-Efficient Reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28NM CMOS,' in 2018 IEEE Symposium on VLSI Circuits, 2018, pp. 37-38. [16] M. Anders et al., '2.9TOPS/W Reconfigurable Dense/Sparse Matrix-Multiply Accelerator with Unified INT8/INTI6/FP16 Datapath in 14NM Tri-Gate CMOS,' in 2018 IEEE Symposium on VLSI Circuits, 2018, pp. 39-40. [17] S. Zhang et al., 'Cambricon-X: An accelerator for sparse neural networks,' in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1-12. [18] D. Shin, J. Lee, J. Lee, and H. Yoo, '14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks,' in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 240-241. [19] Z. Yuan, Y. Liu, J. Yue, J. Li, and H. Yang, 'CORAL: Coarse-grained reconfigurable architecture for Convolutional Neural Networks,' in 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2017, pp. 1-6. [20] G. Venkatesh, E. Nurvitadhi, and D. Marr, 'Accelerating Deep Convolutional Networks using low-precision and sparsity,' in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2861-2865. [21] A. Parashar et al., 'SCNN: An accelerator for compressed-sparse convolutional neural networks,' in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 27-40. [22] G. Desoli et al., '14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems,' in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 238-239. [23] S. Wang, D. Zhou, X. Han, and T. Yoshimura, 'Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks,' in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, 2017, pp. 1032-1037. [24] Y. Shen, M. Ferdman, and P. Milder, 'Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer,' in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017, pp. 93-100. [25] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, 'DLAU: A Scalable Deep Learning Accelerator Unit on FPGA,' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 3, pp. 513-517, 2017. [26] Y. Umuroglu et al., 'FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,' arXiv:1612.07119, 2016. [27] C. Zhang, F. Zhenman, Z. Peipei, P. Peichen, and C. Jason, 'Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks,' in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2016, pp. 1-8. [28] L. Huimin, F. Xitian, J. Li, C. Wei, Z. Xuegong, and W. Lingli, 'A high performance FPGA-based accelerator for large-scale convolutional neural networks,' in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1-9. [29] S. Wijeratne, S. Jayaweera, M. Dananjaya, and A. Pasqual, 'Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks,' in 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2018, pp. 1-7. [30] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, 'Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,' in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 1-13. [31] P. Judd, A. Delmas, S. Sharify, and A. Moshovos, 'Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing,' arXiv:1705.00125, 2017. [32] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, 'Stripes: Bit-serial deep neural network computing,' in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1-12. [33] A. Erdem, C. Silvano, T. Boesch, A. Ornstein, S. Singh, and G. Desoli, 'Design Space Exploration for Orlando Ultra Low-Power Convolutional Neural Network SoC,' in 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2018, pp. 1-7. [34] Y. Chen et al., 'DaDianNao: A Machine-Learning Supercomputer,' in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609-622. [35] S. Venkataramani et al., 'SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks,' in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 13-26. [36] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H. Yoo, '7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16,' in 2019 IEEE International Solid- State Circuits Conference - (ISSCC), 2019, pp. 142-144. [37] B. Fleischer et al., 'A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Training and Inference,' in 2018 IEEE Symposium on VLSI Circuits, 2018, pp. 35-36. [38] Z. Wenlai et al., 'F-CNN: An FPGA-based framework for training Convolutional Neural Networks,' in 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2016, pp. 107-114. [39] X. Han, D. Zhou, S. Wang, and S. Kimura, 'CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks,' in 2016 IEEE 34th International Conference on Computer Design (ICCD), 2016, pp. 320-327. [40] V. Sze, Y. Chen, T. Yang, and J. S. Emer, 'Efficient Processing of Deep Neural Networks: A Tutorial and Survey,' Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, 2017. [41] P. Lin, M. Sun, C. Kung, and T. Chiueh, 'FloatSD: A New Weight Representation and Associated Update Method for Efficient Convolutional Neural Network Training,' IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 267-279, 2019. [42] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, 'Training Deep Neural Networks with 8-bit Floating Point Numbers,' Proc. Adv. Neural Inf. Process. Syst., pp. 7685-7694, 2018. [43] P. Lin, 'Low-complexity Convolutional Neural Network Training and Low Power Circuit Design of its Processing Element,' M.A. Thesis, National Taiwan University, 2017. [44] F. Rosenblatt, 'The Perceptron--a perceiving and recognizing automaton,' Report 85-460-1, Cornell Aeronautical Laboratory, 1957. [45] M. A. Nielsen, 'Neural Networks and Deep Learning,' Determination Press, 2015. [46] T. H. Juang, 'Energy-Efficient Accelerator Architecture for Neural Network Training and its Circuit Design,' Master Thesis, National Taiwan University, 2018. [47] Y. Jia et al., 'Caffe: Convolutional Architecture for Fast Feature Embedding,' Proc. ACM Int. Conf. Multimedia, pp. 675-678, 2014. [48] TSRI. (7-Jul-2019). TSRI AI SoC設計平台簡介. Available: https://www.tsri.org.tw/aisoc/aisoc.jsp [49] ARM, AMBA AXI and ACE Protocol Specification AXI3, AXI4, and AXI4-Lite and ACE-Lite. 2011. [50] P. Elenius and L. Levine. (2000) Comparing Flip-Chip and Wire-Bond Interconnection Technologies. Chip Scale Review. [51] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning MIT Press, 2016. [52] P. Bright. (2017). Google brings 45 teraflops tensor flow processors to its compute cloud. Available: https://arstechnica.com/information-technology/2017/05/google-brings-45-teraflops-tensor-flow-processors-to-its-compute-cloud/ [53] P. Clark. (2016). AI Chip Startup Shares Insights. Available: https://www.eetimes.com/document.asp?doc_id=1330739 [54] (2017). Graphcore at NIPS 2017 Presentations. Available: https://www.graphcore.ai/nips2017_presentations
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74121	-
dc.description.abstract	人工智慧在近期的複蘇是基於深度學習的演算法。深層神經網絡（DNN）在許多電腦視覺的應用中已經超出人類用肉眼辨識的能力，例如物件檢測，圖像分類以及像AlphaGo遊戲之類的應用。深度學習的想法可以追溯到20世紀50年代，演算法上關鍵的突破發生在20世紀80年代。但是只有在最近的這幾年裡，才有強大的硬體資源來支援神經網絡的訓練。即使到了現在，機器學習依然在影響著各行各業，甚至於我們的生活。因此，針對深度學習演算法，去設計一個功能強大且高效的硬體加速器至關重要。執行深度學習演算法的硬體加速器必須有足夠的彈性，以支援各種結構的神經網路。舉例而言，GP-GPU之所以會被廣泛的應用於神經網路加速，就是因為使用者能夠在GPU上執行任意的程式碼。除了GP-GPU外，在過去的幾年裡，學界甚至業界已經開始投入到硬體加速相關的研究當中。一般而言，ASIC晶片的效率會比GP-GPU還要好10倍以上。然而，現有的加速器主要還是集中在推理上。但是，隨著技術的發展，神經網路的訓練也會逐漸的成為趨勢。與推理不同，訓練需要高動態範圍才能保證好的訓練效果。在這篇論文裡面，我們設計了Floating-Point Signed Digit（FloatSD）的數字表示格式，目的是要降低卷積神經網路（CNN）推理和訓練時的計算複雜度。通過同時設計數字表示方式以及相關的電路，我們希望可以在保證準確度的情況下達到省電，省面積的目的。本論文著重在，基於FloatSD數字表示格式SOC系統的設計。該系統可以用於AI的訓練和推理。 SOC由AI IP，DDR3控制器以及ARC HS34 CPU等組成，各個元件通過內部的AXI / AHB AMBA Bus Fabric進行溝通。 CPU可以通過AHB從端口對平台進行編程，以支援各種神經網路的模型。我們將已完成的SOC在HAPS-80 FPGA平台上進行各種不同的測試和驗證。之後，再使用標準的數位元件合成，自動佈局佈線（APR）流程來下線一個28nm的晶片。晶片正常的操作條件下（400MHz）下，能夠達到1.38TFLOPS的速度以及2.34TFLOPS / W的效率。	zh_TW
dc.description.abstract	The recent resurgence of artificial intelligence is due to advances in deep learning. Deep neural network (DNN) has exceeded human capability in many computer vision applications, such as object detection, image classification and playing games like Go. The idea of deep learning dates back to as early as the 1950s, with the key algorithmic breakthroughs occurred in the 1980s. Yet, it has only been in the past few years, that powerful hardware accelerators became available to train neural networks. Even now, the demand for machine learning algorithms is still increasing; and it is affecting almost every industry. Therefore, designing a powerful and efficient hardware accelerator for deep learning algorithms is of critical importance for the time being. The accelerators that run the deep learning algorithm must be general enough to support deep neural networks with various computational structures. For instance, general-purpose graphics processing units (GP-GPUs) were widely adopted for deep learning tasks ever since they allow users to execute arbitrary code on them. Other than graphics processing units, researchers have also paid a lot of attention to hardware acceleration of deep neural networks (DNNs) in the last few years. Google developed its own chip called the Tensor Processing Unit (TPU) to power its own machine learning services [8]; while Intel unveiled its first generation of ASIC processor, called Nervana, for deep learning a few years ago [9]. ASICs usually provide a better performance, compared with FPGA and software implementations. Nevertheless, existing accelerators mostly focus on inference. However, local DNN training is still required to meet the needs of new applications, such as incremental learning and on-device personalization. Unlike inference, training requires high dynamic range in order to deliver high learning quality. In this work, we introduce the floating-point signed digit (FloatSD) data representation format for reducing computational complexity required for both the inference and the training of a convolutional neural network (CNN). By co-designing data representation and circuit, we demonstrate that we can achieve high raw performance and optimal efficiency – both energy and area – without sacrificing the quality of training. This work focuses on the design of FloatSD based system on chip (SOC) for AI training and inference. The SOC consists of an AI IP, integrated DDR3 controller and ARC HS34 CPU through AXI/AHB standard AMBA interfaces. The platform can be programmed by the CPU via the AHB slave port to fit various neural network topologies. The completed SOC has been tested and validated on the HAPS-80 FPGA platform. A synthesis and automated place and route (APR) flow is used to tape out a 28 nm test chip, after testing and verifying the correctness of the SOC. At its normal operating condition (e.g. 400MHz), the accelerator is capable of 1.38 TFLOPs peak performance and 2.34 TFLOPS/W.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:20:46Z (GMT). No. of bitstreams: 1 ntu-108-R06943159-1.pdf: 13869164 bytes, checksum: 4baa0103d1c441c878d37dca6a993492 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	誌謝 i 摘要 iii Abstract v Table of Contents vii List of Tables x List of Figures xi Chapter 1 Introduction 1 1.1 Background 1 1.2. Motivation of the Thesis 3 1.3. Organization and Contributions of the Thesis 4 Chapter 2 Neural Networks 6 2.1. Multilayer Perceptron (MLP) 6 2.1.1. The Architecture of Multilayer Perceptron (MLP) 7 2.1.2. The Backpropagation Algorithm 9 2.2. Convolutional Neural Networks (CNNs) 11 2.2.1. The Architecture of Convolutional Neural Networks (CNNs) 11 2.2.2. The Backpropagation Algorithm 13 Chapter 3 Data Representation for Low Complexity Training and Inference 16 3.1. Signed Digit (SD) Representation 16 3.1.1. The Update Mechanism of Signed Digit (SD) Representation 18 3.2. FloatSD 19 3.3. Quantization 20 3.4. Software Simulations 22 3.4.1. Simulation Results 22 3.5. FloatSD Processing Element (PE) 26 3.5.1. Encoding and Decoding of 8-bit Weight 27 3.5.1.1. Multiplication 28 3.5.1.2. Conversion From Truncated Float to FloatSD 33 3.5.1.3. Adding all the Partial Products 35 3.5.2. Architecture 37 Chapter 4 The Architecture of FloatSD based AI IP 40 4.1. High-level Planning 40 4.1.1. An ARC HS34 based SoC Platform 41 4.1.2. Hardware-Firmware Partition 42 4.1.3. Data Allocation in the Main Memory (DRAM) 43 4.2. Overall Architecture of AI IP 44 4.3. Controllers 45 4.3.1. AHB Controller 46 4.3.2. Top Controller 49 4.3.3. Local Controller 52 4.4. Address Generation Units 54 4.4.1. Read Address Generation Unit 54 4.4.2. Write Address Generation Unit 56 4.4.3. Result Address Generation Unit 57 4.5. Access of SRAMs 59 4.5.1. Direct Memory Accesses (DMAs) 59 4.5.2. Local Switch 63 4.6. PE Cubes 66 Chapter 5 Verification of FloatSD based AI IP 71 5.1. Convolutional Layers 71 5.1.1. Forward Propagation 71 5.1.2. Backward Propagation 75 5.1.3. Computation of Gradients 77 5.2. Fully-connected Layers 78 5.2.1. Forward Propagation 79 5.2.2. Backward Propagation 82 5.2.3. Computation of Gradients 87 Chapter 6 Overall CNN Training Simulation and Validation by FPGA 89 6.1. Software Model 89 6.2. RTL Simulation 91 6.3. FPGA Validation 92 Chapter 7 SOC Design 96 7.1. Synthesis 96 7.1.1. Synthesis Results 96 7.1.2. Pre-layout Simulation 100 7.2. Automated Place and Route (APR) 102 7.2.1. Innovus Foundation Flow 102 7.2.2. Floorplan and Power-plan 103 7.3. Summary 115 Chapter 8 Conclusions and Future Work 118 8.1. Conclusions 118 8.2. Future Work 119 Bibliography 122
dc.language.iso	en
dc.title	高效能卷積神經網路訓練系統加速晶片	zh_TW
dc.title	An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	楊佳玲(Chia-Lin Yang),蔡佩芸(Pei-Yun Tsai)
dc.subject.keyword	FloatSD,訓練,SOC,卷積神經網路,深度學習網路,	zh_TW
dc.subject.keyword	FloatSD,Training,System on Chip (SOC),Convolutional Neural Network (CNN),Deep Neural Network (DNN),	en
dc.relation.page	128
dc.identifier.doi	10.6342/NTU201903036
dc.rights.note	有償授權
dc.date.accepted	2019-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	13.54 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。