適用於特定神經網路任務之可調節映射平行度的模組化加速器

林宥陞; You-Sheng Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87914

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳良基	zh_TW
dc.contributor.advisor	Liang-Gee Chen	en
dc.contributor.author	林宥陞	zh_TW
dc.contributor.author	You-Sheng Lin	en
dc.date.accessioned	2023-07-31T16:17:00Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-07-31	-
dc.date.issued	2023	-
dc.date.submitted	2023-06-29	-
dc.identifier.citation	KWON, Hyoukjun, et al. Heterogeneous dataflow accelerators for multi-DNN workloads. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021. p. 71-83. CHEN, Yu-Hsin, et al. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019, 9.2: 292-308. QIN, Eric, et al. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020. p. 58-70. KWON, Hyoukjun; SAMAJDAR, Ananda; KRISHNA, Tushar. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. ACM SIGPLAN Notices, 2018, 53.2: 461-475. HE, Kaiming, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770-778. SANDLER, Mark, et al. Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. p. 4510-4520. VASWANI, Ashish, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30. KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON, Geoffrey E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60.6: 84-90. SIMONYAN, Karen; ZISSERMAN, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. HOWARD, Andrew G., et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. NVIDIA, “Nvdla deep learning accelerator,” http://nvdla.org, 2017. CHEN, Yu-Hsin, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 2016, 52.1: 127-138. DU, Zidong, et al. ShiDianNao: Shifting vision processing closer to the sensor. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2015. p. 92-104. KONG, Taeyoung, et al. Hardware abstractions and hardware mechanisms to support multi-task execution on coarse-grained reconfigurable arrays. arXiv preprint arXiv:2301.00861, 2023. BOROUMAND, Amirali, et al. Google neural network models for edge devices: Analyzing and mitigating machine learning inference bottlenecks. In: 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2021. p. 159-172. KWON, Hyoukjun, et al. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 2019. p. 754-768. PARASHAR, Angshuman, et al. Timeloop: A systematic approach to dnn accelerator evaluation. In: 2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 2019. p. 304-315. YANG, Xuan, et al. Interstellar: Using halide's scheduling language to analyze dnn accelerators. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020. p. 369-383. GUIRADO, Robert, et al. Understanding the impact of on-chip communication on DNN accelerator performance. In: 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS). IEEE, 2019. p. 85-88. ARORA, Sanjeev; LEIGHTON, Tom; MAGGS, Bruce. On-line algorithms for path selection in a nonblocking network. In: Proceedings of the twenty-second annual ACM symposium on Theory of computing. 1990. p. 149-158. CHANG, Kuo-Wei; CHANG, Tian-Sheuan. VWA: Hardware efficient vectorwise accelerator for convolutional neural network. IEEE Transactions on Circuits and Systems I: Regular Papers, 2019, 67.1: 145-154. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories (2009), 22–31. GAO, Mingyu, et al. Tangram: Optimized coarse-grained dataflow for scalable nn accelerators. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019. p. 807-820. LU, Wenyan, et al. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017. p. 553-564. JANG, Jun-Woo, et al. Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile SoC. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021. p. 15-28. WEI, Shaojun, et al. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips. IEEE Transactions on Circuits and Systems I: Regular Papers, 2022. BAEK, Eunjin; KWON, Dongup; KIM, Jangwoo. A multi-neural network acceleration architecture. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020. p. 940-953. SPANTIDI, Ourania, et al. Targeting dnn inference via efficient utilization of heterogeneous precision dnn accelerators. IEEE Transactions on Emerging Topics in Computing, 2022, 11.1: 112-125. QIN, Eric, et al. Enabling flexibility for sparse tensor acceleration via heterogeneity. arXiv preprint arXiv:2201.08916, 2022. VENIERIS, Stylianos I.; BOUGANIS, Christos-Savvas; LANE, Nicholas D. Multi-DNN Accelerators for Next-Generation AI Systems. arXiv preprint arXiv:2205.09376, 2022. SYMONS, Arne, et al. Towards Heterogeneous Multi-core Accelerators Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks. arXiv preprint arXiv:2212.10612, 2022. ZENG, Shulin, et al. Serving Multi-DNN Workloads on FPGAs: a Coordinated Architecture, Scheduling, and Mapping Perspective. IEEE Transactions on Computers, 2022. LI, Baoting, et al. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration. IEEE Transactions on Circuits and Systems I: Regular Papers, 2021, 68.8: 3279-3292. WANG, Minjie; HUANG, Chien-chin; LI, Jinyang. Supporting very large models using automatic dataflow graph partitioning. In: Proceedings of the Fourteenth EuroSys Conference 2019. 2019. p. 1-17. SONG, Linghao, et al. Hypar: Towards hybrid parallelism for deep learning accelerator array. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019. p. 56-68. JOUPPI, Norman P., et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture. 2017. p. 1-12. HAN, Meng, et al. ReDas: Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array. arXiv preprint arXiv:2302.07520, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87914	-
dc.description.abstract	AI在各個領域廣泛應用。為了應對具有數十億參數和快速演進架構的模型的複雜性，神經網絡模型和計算能力需要高度整合。儘管通用加速器使用複雜的網絡來適應模型變化，但特定任務的加速器提供了更好的解決方案。通過分析，我們發現神經網絡模型的變化是漸進和可預測的。我們提出了一種新的架構，將神經網絡模型劃分為具有相似計算特性的子集。通過將這些子集映射到優化的子加速器上，我們實現了計算能力和神經網絡模型之間的高度整合。我們的架構在Resnet50中相較於最先進的加速器，平均減少了32％的PE使用量和24％的能源消耗。對於像UNet這樣的影像分割模型，相較於最先進的加速器，我們提供了49％的PE使用量減少和39％的能源消耗減少。	zh_TW
dc.description.abstract	AI is widely used in various domains. To handle the complexity of models with billions of parameters and rapidly evolving architectures, NN models and computational power need to be highly integrated. While general-purpose accelerators used complex networks to adapt to model variations, task-specific accelerators offer better solutions.Through analysis, we found that NN model variations are gradual and predictable. We propose a new architecture that divides NN models into subsets with similar computational characteristics. By mapping these subsets to optimized sub-accelerators, we achieve a high level of integration between computational power and NN models.Our architecture reduces PE usage by an average of 32% and energy costs by 24% compared to state-of-the-art accelerators in Resnet50. For models like UNet, we provide a 49% decrease in PE usage and a 39% decrease in energy costs compared to state-of-the-art accelerators.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-07-31T16:16:59Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-07-31T16:17:00Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Abstract i List of Figures vii List of Tables xi 1 Introduction 1 1.1 Motivation 1 1.2 Goals 2 1.3 Design Challenges for IP Architecture 3 1.4 Design Consideration 7 1.5 Contributions of This Thesis 9 1.6 Thesis Organization 9 2 Background 11 2.1 Dataflow and Mapping Parallelism 11 2.2 DNN Accelerator 15 2.2.1 Overview 15 2.2.2 Reconfigurable Dataflow Accelerator (RDA) 16 2.2.3 Heterogeneous Dataflow Accelerator (HDA) 17 2.3 Summary 18 3 Proposed Architecture 19 3.1 Adjustable Parallelism Dataflow Accelerator (APDA) 21 3.1.1 Overview 21 3.1.2 Design Consideration of APDA 22 3.2 Adjustable Dataflow and Architecture 23 3.2.1 Dataflow Selection 23 3.2.2 Sub-Accelerator Architecture 25 3.3 Composition of Sub-Accelerator 31 3.3.1 Mapping Parallelism Selection 31 3.3.2 Hardware Resource Allocation 33 3.4 Design Space Exploration (DSE) 34 3.4.1 Overview 35 3.4.2 Optimization Target 36 3.4.3 Area and Energy Estimation 37 3.4.4 Algorithm 37 3.4.5 Framework 39 3.5 Summary 39 4 Result 41 4.1 Architecture Configurations 41 4.2 Experimental Methodology 43 4.2.1 Evaluation Method 44 4.2.2 Benchmark 44 4.3 Performance and Hardware Usage 45 4.3.1 Sub-Accelerator Notation 45 4.3.2 Impact of APDA with Sub-Accelerator 46 4.3.3 Comparison with Previous Works 58 4.4 PE Parallelism Scalability 61 4.5 Energy Cost 63 4.5.1 Impact of Data Broadcast 64 4.5.2 Comparison with Previous Works 65 5 Discussion 69 5.1 DRAM Access 69 5.2 DRAM Bandwidth 70 5.3 Overhead of Data Broadcast 71 6 Conclusion 75 Reference 77	-
dc.language.iso	en	-
dc.subject	神經網路加速器	zh_TW
dc.subject	網路分割	zh_TW
dc.subject	可調整資料流	zh_TW
dc.subject	特定任務加速器	zh_TW
dc.subject	模組化架構	zh_TW
dc.subject	DNN accelerator	en
dc.subject	Task-specific accelerator	en
dc.subject	Network partition	en
dc.subject	IP-Base Architecture	en
dc.subject	Adjustable dataflow	en
dc.title	適用於特定神經網路任務之可調節映射平行度的模組化加速器	zh_TW
dc.title	IP-Based Accelerator with Adjustable Mapping Parallelism Dataflow for Task-Specific DNN	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	賴永康;黃朝宗;楊佳玲	zh_TW
dc.contributor.oralexamcommittee	Yeong-Kang Lai;Chao-Tsung Huang;Chia-Lin Yang	en
dc.subject.keyword	模組化架構,特定任務加速器,神經網路加速器,可調整資料流,網路分割,	zh_TW
dc.subject.keyword	IP-Base Architecture,Task-specific accelerator,DNN accelerator,Adjustable dataflow,Network partition,	en
dc.relation.page	82	-
dc.identifier.doi	10.6342/NTU202301139	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-06-29	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
dc.date.embargo-lift	2028-06-27	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	2.92 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。