具適應智能之可程式化深度學習處理器

呂丞勛; Cheng-Hsun Lu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78852

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	楊家驤	zh_TW
dc.contributor.advisor	Chia-Hsiang Yang	en
dc.contributor.author	呂丞勛	zh_TW
dc.contributor.author	Cheng-Hsun Lu	en
dc.date.accessioned	2021-07-11T15:24:07Z	-
dc.date.available	2024-01-01	-
dc.date.copyright	2019-01-24	-
dc.date.issued	2019	-
dc.date.submitted	2002-01-01	-
dc.identifier.citation	[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015. [2] D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, pp. 533-–536, Oct. 1986. [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” IEEE Computer Vision and Pattern Recognition (CVPR), 2009. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105, 2012. [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015. [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770-–778, Jun. 2016. [7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CVPR, pp. 1–9, 2015. [8] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2012, pp. 4277–4280. [9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3431–-3440. [10] J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 2625–-2634. [11] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 262–263, Feb. 2016. [12] B. Moons, M. Verheist, “A 0.3–2.6 TOPS/W precision-scalable processor for realtime large-scale ConvNets,” Proc. IEEE Symp. VLSI Circuits, pp. 1–2, Jun. 2016. [13] B. Moons, M. Verheist, “An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 52, pp. 903–914, no. 4, Apr. 2017. [14] B. Moons, R. Uytterhoeven, W. Dehaene, M. Verheist, “Envision: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOI,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 246–257, Feb. 2017. [15] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, L. Liu, S. Wei, “A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications,” Symposium on VLSI Circuits Digest of Technical Papers, pp. 26–27, 2017. [16] J. Lee, C. Kim, S.kang, D. Shin, S. Kim, H.-H. Yoo, “UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 218–219, Feb. 2018. [17] S. Choi, J. Lee, K. Lee, and H.-H. Yoo, “A 9.02mW CNN-Stereo-Based Real-Time 3D Hand-Gesture Recognition Processor for Smart Mobile Devices,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 220–221, Feb. 2018. [18] D. Bankman, L. Yang, B. Moons, M. Verhelst, B. Murmann, “An Always-On 3.8μJ/86% CIFAR-10 Mixed-Signal Binary CNN Processor with All Memory on Chip in 28nm CMOS,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 222–223, Feb. 2018. [19] S. Yin, P. Ouyang, J. Yang, T. Lu, X. Li, L. Liu, S. Wei, “An ultra-high energy-efficient reconfigurable processor for deep neural networks with binary/ternary weights in 28nm CMOS,” Symposium on VLSI Circuits Digest of Technical Papers, pp. 37–38, 2018. [20] S. Yin, P. Ouyang, S. Zheng, D. Song, X. Li, L. Liu, S. Wei, “A 141 uW, 2.46 pJ/Neuron Binarized Convolutional Neural Network based Self-learning Speech Recognition Processor in 28nm CMOS,” Symposium on VLSI Circuits Digest of Technical Papers, pp. 139–140, 2018. [21] D. Shin, J. Lee, J. Lee, H.-J. Yoo, “DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 240–242, Feb. 2017. [22] Z. Yuan, J. Yue, H. Yang, Z. Wang, J. Li, Y. Yang, Q. Guo, X. Li, M.-F. Chang, H. Yang and Y. Liu, “STICKER: A 0.41-62.1 TOPS/W 8bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers, ”Symposium on VLSI Circuits Digest of Technical Papers, pp. 139–140, 2018. [23] S. Bang, J. Wang, Z. Li, C. Gao, Y. Kim, Q. Dong, Y.-P. Chen, L. Fick, X. Sun, R. Dreslinski, T. Mudge, H.-S. Kim, D. Blaauw, D. Sylvester, “A 288μW Programmable Deep-Learning Processor with 270KB On-Chip Weight Storage Using Non-Uniform Memory Hierarchy for Mobile Intelligence,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 250–252, Feb. 2017. [24] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 267–-278. [25] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, G.-Y. Wei, “A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications,” International Solid-State Circuits Conference (ISSCC) digest technical papers, pp. 242–243, Feb. 2017. [26] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, “How transferable are features in deep neural networks?,” Advances in Neural Information Processing System 27 (NIPS), pp 3320–3328, 2014. [27] B. Fleischer, et. al., “A Scalable Multi-TeraOPS Deep Learning Processor Core for AI Training and Inference,” Symposium on VLSI Circuits Digest of Technical Papers, pp. 35–36, 2018.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78852	-
dc.description.abstract	深度學習已廣泛使用於各種領域，並且在某些應用上已達到超越人類的性能。為了滿足於對於運算能力的需求，目前已有許多客製化的深度神經網路推論加速器。主動學習機制對於安全與隱私保護有顯著的提升，特別是用於醫療照護及身份認證，針對使用者特徵進行神經網路調整更可進一步提高辨識準確度。這些功能都仰賴晶片上訓練，但目前支援晶片上訓練卻相當有限。考量訓練計算所需的運算複雜度比起推論高上許多，設計具高能量效率能支援推論及訓練的深度學習處理器相當具有挑戰。本設計提出文獻上第一顆可同時支援深度神經網路推論與訓練的客制化卷積式神經網路處理器，可支援各種神經網路維度與多種精準度需求。針對推理及訓練的卷積運算流程進行資料重新排序，並將卷積層及完全連接層的運算轉換成相同運算以大幅提升處理器性能。最大池化層與線性整流器的共同設計可以降低近75%的記憶體需求。簡化後的歸一化函數可省下78%的硬體資源。浮點數與固定點數整合分別為乘法器與加法器省下56.8%與17.3%的硬體資源，合併兩者之乘加器更能進一步節省33%的硬體資源。透過資料閘控及時脈閘控，可以在低精準度模式省下62%功率消耗。處理器以40nm實現，推理階段能達到1.25 TOPS/W的能量效能，與文獻上之推理加速器效能相當。操作於訓練模式之能量效能可達 327 GOPS/W之，達到高於CPU 105倍的能量效率。	zh_TW
dc.description.abstract	Deep learning has been widely deployed in many areas and demonstrates beyond-human performance in some applications. In order to meet the required computing power, many dedicated accelerators for deep neural networks in inference have been proposed. Active learning is beneficial to security and privacy protection, especially for the health care and ID verification. In addition, Self-adaptation can further improve the classification performance by leveraging users’ specific features. These are enabled by on-chip training, but only limited on-chip training is supported by existing solutions. Considering the computational complexity of training is much higher than that of inference, designing an energy-efficient processor for both inference and training is very challenging. This work presents a deep learning processor that can support both inference and training for convolutional neural networks with any dimensions and variable precisions. Data re-arrangement and operations formulation are utilized to significantly improve the performance. Maxpooling and ReLU modules are co-designed to reduce the memory requirement by 75%. The softmax function is modified to reduce the hardware area by 78%. The integration of fixed-point and floating-point operators reduce the area of multipliers and adders by 56.8% and 17.3%, respectively. Integrating a multiplier and an adder into a unified MAC unit further reduces area by 33%. In the low-precision mode, clock gating and data gating are employed to reduce the power consumption by 62%. Fabricated in 40nm technology, the proposed deep learning processor achieves 1.25TOPS/W in inference, which is competitive with state-of-the-art inference designs. The chip also delivers an energy efficiency of 327GOPS/W in training, which is 105 higher than a high-end CPU.	en
dc.description.provenance	Made available in DSpace on 2021-07-11T15:24:07Z (GMT). No. of bitstreams: 1 ntu-108-R05943005-1.pdf: 3814795 bytes, checksum: 48f5e1576cdf84c0e5d542bfce961434 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員會審定書 ii 誌謝 iii 摘要 iv ABSTRACT v Contents vii List of Figures ix List of Tables x 1 INTRODUCTION 1 2 Convolutional Neural Network Algorithm 4 2.1 Inference Algorithm 4 2.2 Training Algorithm 6 2.2.1 Derivatives of convolutional layers 7 2.2.2 Derivatives of fully-connected layers 9 2.2.3 Derivatives of activation functions 9 3 System Architecture 10 3.1 Processing Unit (PU) 12 3.2 Processing Element Cluster (PEC) 15 3.3 Maxpooling and ReLU Module 18 3.4 Modified Softmax Module 20 4 Power-Area Optimization 21 4.1 Filter Rearrangement 21 4.2 Fixed-Floating-Point Hardware Integration 23 4.3 Multiplier with Variable Wordlengths 25 4.4 Modified Softmax 27 5 Chip Implementation 29 6 CONCLUSION 34 References 36	-
dc.language.iso	en	-
dc.subject	CMOS 數位積體電路	zh_TW
dc.subject	晶片上訓練	zh_TW
dc.subject	神經網路調整	zh_TW
dc.subject	主動學習	zh_TW
dc.subject	卷積式神經網路	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	CMOS digital integrated circuits	en
dc.subject	neural network adaptation	en
dc.subject	active learning	en
dc.subject	convolutional neural network	en
dc.subject	Deep learning	en
dc.subject	on-chip training	en
dc.title	具適應智能之可程式化深度學習處理器	zh_TW
dc.title	A Fully-Programmable Deep Learning Processor with Adaptable Intelligence	en
dc.type	Thesis	-
dc.date.schoolyear	107-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	劉宗德;柳德政	zh_TW
dc.contributor.oralexamcommittee	Tsung-Te Liu;	en
dc.subject.keyword	深度學習,卷積式神經網路,主動學習,神經網路調整,晶片上訓練,CMOS 數位積體電路,	zh_TW
dc.subject.keyword	Deep learning,convolutional neural network,active learning,neural network adaptation,on-chip training,CMOS digital integrated circuits,	en
dc.relation.page	39	-
dc.identifier.doi	10.6342/NTU201900136	-
dc.rights.note	未授權	-
dc.date.accepted	2019-01-22	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
dc.date.embargo-lift	2024-01-24	-
Appears in Collections:	電子工程學研究所

Files in This Item:

File	Size	Format
ntu-107-1.pdf Restricted Access	3.73 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets