內核分解法之靈活卷積神經網絡推理加速器

柯宗賢; Tsung-Hsien Ke

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87915

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳良基	zh_TW
dc.contributor.advisor	Liang-Gee Chen	en
dc.contributor.author	柯宗賢	zh_TW
dc.contributor.author	Tsung-Hsien Ke	en
dc.date.accessioned	2023-07-31T16:17:23Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-07-31	-
dc.date.issued	2023	-
dc.date.submitted	2023-06-29	-
dc.identifier.citation	Liu, Chia-Ning, et al. "Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks." 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 2021. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits 52.1 (2016): 127-138. Y. H. Chen, T. J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.2 (2019): 292-308. K. Chang and T. Chang, “VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 1, pp. 145-154, Jan. 2020. Samajdar, Ananda, et al. "Self adaptive reconfigurable arrays (SARA) learning flexible GEMM accelerator configuration and mapping-space using ML." Proceedings of the 59th ACM/IEEE Design Automation Conference. 2022. Han, Meng, et al. "ReDas: Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array." arXiv preprint arXiv:2302.07520 (2023). Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks." ACM SIGARCH computer architecture news 44.3 (2016): 367-379. LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90. Lym, Sangkug, and Mattan Erez. "FlexSA: Flexible systolic array architecture for efficient pruned DNN model training." arXiv preprint arXiv:2004.13027 (2020). Xu, Rui, et al. "HESA: Heterogeneous systolic array architecture for compact CNNs hardware accelerators." 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021. Kwon, Hyoukjun, et al. "Heterogeneous dataflow accelerators for multi-DNN workloads." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021. Jo, Jihyuck, Suchang Kim, and In-Cheol Park. "Energy-efficient convolution architecture based on rescheduled dataflow." IEEE Transactions on Circuits and Systems I: Regular Papers 65.12 (2018): 4196-4207. Valpreda, Emanuele, et al. "HW-Flow-Fusion: Inter-Layer Scheduling for Convolutional Neural Network Accelerators with Dataflow Architectures." Electronics 11.18 (2022): 2933. Xu, Rui, et al. "Configurable multi-directional systolic array architecture for convolutional neural networks." ACM Transactions on Architecture and Code Optimization (TACO) 18.4 (2021): 1-24. Liu, Chia-Ning, et al. "Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks." 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 2021. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90. Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. NVIDIA, “Nvdla deep learning accelerator,” http://nvdla.org, 2017. Du, Zidong, et al. "ShiDianNao: Shifting vision processing closer to the sensor." Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2015. Huang, Jiye, et al. "A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator." Electronics 12.7 (2023): 1571. Li, Guoqing, et al. "Efficient depthwise separable convolution accelerator for classification and UAV object detection." Neurocomputing 490 (2022): 1-16. Manasi, Susmita Dey, et al. "Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators." Proceedings of the 28th Asia and South Pacific Design Automation Conference. 2023. Genc, Hasan, et al. "Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration." 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021. Inayat, Kashif, and Jaeyong Chung. "Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 30.7 (2022): 881-892. Qin, Eric, et al. "Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training." 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020. Park, Jun-Seok, et al. "A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC." IEEE Journal of Solid-State Circuits 58.1 (2022): 189-202. Du, Cheng-Yan, et al. "A 28nm 11.2 TOPS/W Hardware-Utilization-Aware Neural-Network Accelerator with Dynamic Dataflow." 2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023. Noh, Seock-Hwan, et al. "FlexBlock: A flexible DNN training accelerator with multi-mode block floating point support." IEEE Transactions on Computers (2023). Muñoz-Martínez, Francisco, et al. "Flexagon: A Multi-Dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing." Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 2023. Park, Sang-Soo, and Ki-Seok Chung. "CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration." Electronics 11.15 (2022): 2373.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87915	-
dc.description.abstract	隨著人工智慧的發展，現今的邊緣裝置也希望能夠加速卷積神經網絡，各種不同的人工智慧應用對應到不同的網絡來做運算，因此我們希望提出一個通用型的加速器能夠在不同的網絡中都能有效地進行運算。但是這些不同的網絡又會有不同的運算平行度，而且邊緣裝置也有硬體的限制，因此我們提出了可以同時考量運算平行度跟硬體限制的理論。在此篇論文中我們提出了內核分解法，即是希望能透過將各種類型的卷積都轉換成1x1的卷積來做運算，並藉此來增加可有效加速的運算種類，而我們確實能在此方法下，對不同的運算做加速時，都能保持在高的運算單元使用率(>90%)。此外我們還提出了一種排程的方法，可以在高的運算單元使用率的前提下，再對DRAM的資料讀取進行優化，以盡可能地達到較低的能量消耗，並且可以在使用了1024個運算單元的情況下，在Alexnet有了37.42%的DRAM資料讀取量下降，以及在VGG16有了52.44%的DRAM資料讀取量下降。	zh_TW
dc.description.abstract	There are several applications of CNN embedded in edge devices. The applications target to various convolutional neural networks, which have different computational parallelisms (CP). To design an accelerator for various networks in edge device, we need to consider the various CPs and the hardware resource constraints. In this thesis, we propose the Kernel Decomposition (KD) method, a methodology for converting CONVs to 1x1 CONVs with stride of 1, which provides more flexibility to the architecture during data mapping and can achieve good PE utilization (> 90%). In addition, based on good PE utilization, the data scheduling approach can pursue minimal DRAM access, which can be reduced by 37.42% and 52.44% in the Alexnet and VGG16 with 1024 PEs respectively.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-07-31T16:17:23Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-07-31T16:17:23Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	論文口試委員審定書 ii 摘要 iv Abstract vi Contents viii List of Figures xiv I. Introduction 1 I.A. Motivation 1 I.B. Goals 1 I.B.i. High performance 1 I.B.ii. Energy efficiency 2 I.C. Design challenges for CNN accelerator 2 I.C.i. Various layer shapes of CNN networks 2 I.C.ii. Different computations of CNN 3 I.C.iii. Large size of input feature map and filters 3 I.D. Design targets for CNN accelerator 3 I.D.i. PE utilization 4 I.D.ii. DRAM access 4 I.D.iii. Flexibility 5 I.E. Major contributions 5 I.F. Thesis organization 5 II. Background 7 II.A. Convolution neural network 7 II.A.i. Overview 7 II.A.ii. Various convolutional neural networks 8 II.B. Various convolutional layers in hardware design 9 II.B.i. Overview 9 II.B.ii. Size of data 9 II.B.iii. Different operations 10 II.C. CNN accelerators 11 II.C.i. Overview 11 II.C.ii. Throughput targeted accelerators 11 II.C.iii. Energy targeted accelerators 12 II.D. Flexible CNN accelerators 13 II.D.i. Overview 13 II.D.ii. Reconfigurable dataflow 13 II.D.iii. Reconfigurable systolic array 13 II.E. Computational Parallelism (CP) 15 II.E.i. overview 15 II.E.ii. Relationship between CP and PE utilization 16 II.E.iii. Relationship between CP and DRAM access 17 II.F. CP and flexible architecture 20 II.F.i. overview 20 II.F.ii. Reconfigurable dataflow 20 II.F.iii. Reconfigurable architecture 21 II.G. SRAM arrangement problem 21 III. Proposed Method – Kernel Decomposition Method (KD) 25 III.A. Overview 25 III.B. The systolic-array-based architecture 26 III.C. Operation Conversion 28 III.C.i. Overview 28 III.C.ii. Dataflow 28 III.C.iii. Correctness of dataflow 31 III.C.iv. Flexibility of operation conversion 32 III.C.v. PE utilization of operation conversion 35 III.C.vi. DRAM access of operation conversion 36 III.D. Data scheduling 37 III.D.i. Overview 37 III.D.ii. Loop nest of sub-matrices 37 III.D.iii. Comparison with conventional CONV 42 III.E. The analytical model for kernel decomposition method 43 III.E.i. Overview 43 III.E.ii. Analytical model – Operation Conversion 43 III.E.iii. Analytical model – Data Scheduling 44 IV. Result 47 IV.A. Implement results 47 IV.A.i. Evaluation method 47 IV.A.ii. Area result 47 IV.B. Comparison of conventional CONV and kernel decomposition method 49 IV.B.i. Evaluation method 49 IV.B.ii. PE utilization 49 IV.B.iii. DRAM access 51 IV.B.iv. Flexibility 54 IV.C. Comparison with SOTA 55 IV.C.i. Evaluation method 55 IV.C.ii. PE utilization 56 IV.C.iii. DRAM access 57 IV.C.iv. Flexibility 59 V. Discussion 61 V.A. KD in depthwise convolution 61 V.A.i. Overview 61 V.A.ii. Discussion for depthwise CONV 61 V.A.iii. Discussion for pointwise CONV 61 VI. Conclusion 63 VII. Reference 65	-
dc.language.iso	en	-
dc.subject	DRAM讀取	zh_TW
dc.subject	靈活運算	zh_TW
dc.subject	內核分解法	zh_TW
dc.subject	卷積神經網絡加速器	zh_TW
dc.subject	資料排程	zh_TW
dc.subject	CNN accelerator	en
dc.subject	kernel decomposition	en
dc.subject	flexibility	en
dc.subject	DRAM access	en
dc.subject	data scheduling	en
dc.title	內核分解法之靈活卷積神經網絡推理加速器	zh_TW
dc.title	Kernel Decomposition Method for Flexible Convolution Neural Network Inference Accelerator	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	賴永康;黃朝宗;楊佳玲	zh_TW
dc.contributor.oralexamcommittee	Yeong-Kang Lai;Chao-Tsung Huang;CL Yang	en
dc.subject.keyword	靈活運算,內核分解法,卷積神經網絡加速器,資料排程,DRAM讀取,	zh_TW
dc.subject.keyword	flexibility,kernel decomposition,CNN accelerator,data scheduling,DRAM access,	en
dc.relation.page	69	-
dc.identifier.doi	10.6342/NTU202301238	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-06-30	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.65 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。