Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電子工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87915
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳良基zh_TW
dc.contributor.advisorLiang-Gee Chenen
dc.contributor.author柯宗賢zh_TW
dc.contributor.authorTsung-Hsien Keen
dc.date.accessioned2023-07-31T16:17:23Z-
dc.date.available2023-11-09-
dc.date.copyright2023-07-31-
dc.date.issued2023-
dc.date.submitted2023-06-29-
dc.identifier.citationLiu, Chia-Ning, et al. "Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks." 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 2021.
Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits 52.1 (2016): 127-138.
Y. H. Chen, T. J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.2 (2019): 292-308.
K. Chang and T. Chang, “VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 1, pp. 145-154, Jan. 2020.
Samajdar, Ananda, et al. "Self adaptive reconfigurable arrays (SARA) learning flexible GEMM accelerator configuration and mapping-space using ML." Proceedings of the 59th ACM/IEEE Design Automation Conference. 2022.
Han, Meng, et al. "ReDas: Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array." arXiv preprint arXiv:2302.07520 (2023).
Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks." ACM SIGARCH computer architecture news 44.3 (2016): 367-379.
LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90.
Lym, Sangkug, and Mattan Erez. "FlexSA: Flexible systolic array architecture for efficient pruned DNN model training." arXiv preprint arXiv:2004.13027 (2020).
Xu, Rui, et al. "HESA: Heterogeneous systolic array architecture for compact CNNs hardware accelerators." 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021.
Kwon, Hyoukjun, et al. "Heterogeneous dataflow accelerators for multi-DNN workloads." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.
Jo, Jihyuck, Suchang Kim, and In-Cheol Park. "Energy-efficient convolution architecture based on rescheduled dataflow." IEEE Transactions on Circuits and Systems I: Regular Papers 65.12 (2018): 4196-4207.
Valpreda, Emanuele, et al. "HW-Flow-Fusion: Inter-Layer Scheduling for Convolutional Neural Network Accelerators with Dataflow Architectures." Electronics 11.18 (2022): 2933.
Xu, Rui, et al. "Configurable multi-directional systolic array architecture for convolutional neural networks." ACM Transactions on Architecture and Code Optimization (TACO) 18.4 (2021): 1-24.
Liu, Chia-Ning, et al. "Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks." 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 2021.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
NVIDIA, “Nvdla deep learning accelerator,” http://nvdla.org, 2017.
Du, Zidong, et al. "ShiDianNao: Shifting vision processing closer to the sensor." Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2015.
Huang, Jiye, et al. "A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator." Electronics 12.7 (2023): 1571.
Li, Guoqing, et al. "Efficient depthwise separable convolution accelerator for classification and UAV object detection." Neurocomputing 490 (2022): 1-16.
Manasi, Susmita Dey, et al. "Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators." Proceedings of the 28th Asia and South Pacific Design Automation Conference. 2023.
Genc, Hasan, et al. "Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration." 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021.
Inayat, Kashif, and Jaeyong Chung. "Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 30.7 (2022): 881-892.
Qin, Eric, et al. "Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training." 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020.
Park, Jun-Seok, et al. "A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC." IEEE Journal of Solid-State Circuits 58.1 (2022): 189-202.
Du, Cheng-Yan, et al. "A 28nm 11.2 TOPS/W Hardware-Utilization-Aware Neural-Network Accelerator with Dynamic Dataflow." 2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023.
Noh, Seock-Hwan, et al. "FlexBlock: A flexible DNN training accelerator with multi-mode block floating point support." IEEE Transactions on Computers (2023).
Muñoz-Martínez, Francisco, et al. "Flexagon: A Multi-Dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing." Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 2023.
Park, Sang-Soo, and Ki-Seok Chung. "CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration." Electronics 11.15 (2022): 2373.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87915-
dc.description.abstract隨著人工智慧的發展,現今的邊緣裝置也希望能夠加速卷積神經網絡,各種不同的人工智慧應用對應到不同的網絡來做運算,因此我們希望提出一個通用型的加速器能夠在不同的網絡中都能有效地進行運算。但是這些不同的網絡又會有不同的運算平行度,而且邊緣裝置也有硬體的限制,因此我們提出了可以同時考量運算平行度跟硬體限制的理論。
在此篇論文中我們提出了內核分解法,即是希望能透過將各種類型的卷積都轉換成1x1的卷積來做運算,並藉此來增加可有效加速的運算種類,而我們確實能在此方法下,對不同的運算做加速時,都能保持在高的運算單元使用率(>90%)。此外我們還提出了一種排程的方法,可以在高的運算單元使用率的前提下,再對DRAM的資料讀取進行優化,以盡可能地達到較低的能量消耗,並且可以在使用了1024個運算單元的情況下,在Alexnet有了37.42%的DRAM資料讀取量下降,以及在VGG16有了52.44%的DRAM資料讀取量下降。
zh_TW
dc.description.abstractThere are several applications of CNN embedded in edge devices. The applications target to various convolutional neural networks, which have different computational parallelisms (CP). To design an accelerator for various networks in edge device, we need to consider the various CPs and the hardware resource constraints.
In this thesis, we propose the Kernel Decomposition (KD) method, a methodology for converting CONVs to 1x1 CONVs with stride of 1, which provides more flexibility to the architecture during data mapping and can achieve good PE utilization (> 90%). In addition, based on good PE utilization, the data scheduling approach can pursue minimal DRAM access, which can be reduced by 37.42% and 52.44% in the Alexnet and VGG16 with 1024 PEs respectively.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-07-31T16:17:23Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-07-31T16:17:23Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents論文口試委員審定書 ii
摘要 iv
Abstract vi
Contents viii
List of Figures xiv
I. Introduction 1
I.A. Motivation 1
I.B. Goals 1
I.B.i. High performance 1
I.B.ii. Energy efficiency 2
I.C. Design challenges for CNN accelerator 2
I.C.i. Various layer shapes of CNN networks 2
I.C.ii. Different computations of CNN 3
I.C.iii. Large size of input feature map and filters 3
I.D. Design targets for CNN accelerator 3
I.D.i. PE utilization 4
I.D.ii. DRAM access 4
I.D.iii. Flexibility 5
I.E. Major contributions 5
I.F. Thesis organization 5
II. Background 7
II.A. Convolution neural network 7
II.A.i. Overview 7
II.A.ii. Various convolutional neural networks 8
II.B. Various convolutional layers in hardware design 9
II.B.i. Overview 9
II.B.ii. Size of data 9
II.B.iii. Different operations 10
II.C. CNN accelerators 11
II.C.i. Overview 11
II.C.ii. Throughput targeted accelerators 11
II.C.iii. Energy targeted accelerators 12
II.D. Flexible CNN accelerators 13
II.D.i. Overview 13
II.D.ii. Reconfigurable dataflow 13
II.D.iii. Reconfigurable systolic array 13
II.E. Computational Parallelism (CP) 15
II.E.i. overview 15
II.E.ii. Relationship between CP and PE utilization 16
II.E.iii. Relationship between CP and DRAM access 17
II.F. CP and flexible architecture 20
II.F.i. overview 20
II.F.ii. Reconfigurable dataflow 20
II.F.iii. Reconfigurable architecture 21
II.G. SRAM arrangement problem 21
III. Proposed Method – Kernel Decomposition Method (KD) 25
III.A. Overview 25
III.B. The systolic-array-based architecture 26
III.C. Operation Conversion 28
III.C.i. Overview 28
III.C.ii. Dataflow 28
III.C.iii. Correctness of dataflow 31
III.C.iv. Flexibility of operation conversion 32
III.C.v. PE utilization of operation conversion 35
III.C.vi. DRAM access of operation conversion 36
III.D. Data scheduling 37
III.D.i. Overview 37
III.D.ii. Loop nest of sub-matrices 37
III.D.iii. Comparison with conventional CONV 42
III.E. The analytical model for kernel decomposition method 43
III.E.i. Overview 43
III.E.ii. Analytical model – Operation Conversion 43
III.E.iii. Analytical model – Data Scheduling 44
IV. Result 47
IV.A. Implement results 47
IV.A.i. Evaluation method 47
IV.A.ii. Area result 47
IV.B. Comparison of conventional CONV and kernel decomposition method 49
IV.B.i. Evaluation method 49
IV.B.ii. PE utilization 49
IV.B.iii. DRAM access 51
IV.B.iv. Flexibility 54
IV.C. Comparison with SOTA 55
IV.C.i. Evaluation method 55
IV.C.ii. PE utilization 56
IV.C.iii. DRAM access 57
IV.C.iv. Flexibility 59
V. Discussion 61
V.A. KD in depthwise convolution 61
V.A.i. Overview 61
V.A.ii. Discussion for depthwise CONV 61
V.A.iii. Discussion for pointwise CONV 61
VI. Conclusion 63
VII. Reference 65
-
dc.language.isoen-
dc.subjectDRAM讀取zh_TW
dc.subject靈活運算zh_TW
dc.subject內核分解法zh_TW
dc.subject卷積神經網絡加速器zh_TW
dc.subject資料排程zh_TW
dc.subjectCNN acceleratoren
dc.subjectkernel decompositionen
dc.subjectflexibilityen
dc.subjectDRAM accessen
dc.subjectdata schedulingen
dc.title內核分解法之靈活卷積神經網絡推理加速器zh_TW
dc.titleKernel Decomposition Method for Flexible Convolution Neural Network Inference Acceleratoren
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee賴永康;黃朝宗;楊佳玲zh_TW
dc.contributor.oralexamcommitteeYeong-Kang Lai;Chao-Tsung Huang;CL Yangen
dc.subject.keyword靈活運算,內核分解法,卷積神經網絡加速器,資料排程,DRAM讀取,zh_TW
dc.subject.keywordflexibility,kernel decomposition,CNN accelerator,data scheduling,DRAM access,en
dc.relation.page69-
dc.identifier.doi10.6342/NTU202301238-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2023-06-30-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電子工程學研究所-
顯示於系所單位:電子工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
2.65 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved