基於稀疏性之低記憶體使用量激活值壓縮引擎設計

王則勛; Tse-Hsun Wang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87567

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳安宇	zh_TW
dc.contributor.advisor	An-Yeu Wu	en
dc.contributor.author	王則勛	zh_TW
dc.contributor.author	Tse-Hsun Wang	en
dc.date.accessioned	2023-06-20T16:06:21Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-06-20	-
dc.date.issued	2022	-
dc.date.submitted	2022-10-28	-
dc.identifier.citation	[1] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks.” In NIPS, 2012. [2] Dario Amodei and Danny Hernandez, https://openai.com/blog/ai-and-compute [3] Rocio Vargas, Amir Mosavi, and Ramon Ruiz, “Deep learning: A review.” Oct 2018. [4] Shrestha, A.; Mahmood, A. Review of Deep Learning Algorithms and Architectures. IEEE Access 2019, 7, 53040–53065 [5] Nvidia Developer, “https://developer.nvidia.com/deep-learning” [6] Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, Michael S. Lew,“Deep learning for visual understanding: A review,” Neurocomputing, Volume 187, 2016, Pages 27-48, [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556, 2014. [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition.” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [9] Y. Kang et al., “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGPLAN Notices, vol. 52, no. 4, pp. 615–629, 2017. [10] https://github.com/Xilinx/Vitis-AI [11] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Finn: A framework for fast, scalable binarized neural network inference,” in FPGA, 2017. [12] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. ACM/IEEE Int. Symp. Comput. Architecture (ISCA), 2017 [13] Y. -H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017, doi: 10.1109/JSSC.2016.2616357. [14] Y. Park et al., “GRLC: grid-based run-length compression for energy-efficient CNN accelerator,” ISLPED, 2020 [15] Vivienne Sze, “Efficient Processing of Deep Neural Network: from Algorithms to Hardware Architectures” 2019 NeurIPS tutorial [16] Alwani, Manoj, et al., “Fused-layer CNN accelerators,” Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. [17] Zhang, J., Raj, P., Zarar, S., Ambardekar, A., & Garg, S., “CompAct: On-chip Com- pression of Activations for Low Power Systolic Array Based CNN Acceleration,” in ACM Transactions on Embedded Computing Systems (TECS), 2019. [18] Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017 [19] D. Zhang, J. Yang, D. Ye, and G. Hua, “LQ-Nets: Learned quantization for highly accurate and compact deep neural networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 365–382. [20] David A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (Sep. 1952), 1098–1101. [21] Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization, and Huffman coding. CoRR abs/1510.00149. [22] Y. Choi, M. El-Khamy and J. Lee, "Universal Deep Neural Network Compression," in IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 715-726, May 2020, doi: 10.1109/JSTSP.2020.2975903. [23] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017 [24] M. Rhu, M. O'Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, "Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks," 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 78-91, doi: 10.1109/HPCA.2018.00017. [25] Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019, doi: 10.1109/JETCAS.2019.2910232. [26] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016, “EIE: Efficient inference engine on compressed deep neural network.” In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, Piscataway, NJ, 243–254. [27] Ko, Yousun, Alex Chadwick, Daniel Bates, and Robert Mullins. "Lane compression: A lightweight lossless compression method for machine learning on embedded systems." ACM Transactions on Embedded Computing Systems (TECS) 20, no. 2 (2021): 1-26. [28] J. Zhu et al., “Flexible-width Bit-level Compressor for Convolutional Neural Network,” in Proc. IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2021, pp. 1-4. [29] G. Retsinas, A. Elafrou, G. Goumas, and P. Maragos, "Weight Pruning via Adaptive Sparsity Loss," arXiv e-prints, p. arXiv:2006.02768. [Online]. Available: https://ui.adsabs.harvard. edu/abs/ 2020arXiv2006 02768R [30] (2020 ISLPED) GRLC Grid-based Run-length Compression for Energy-efficient CNN Accelerator [31] Gil Shomron, Freddy Gabbay, Samer Kurzum, and Uri Weiser. Post-training sparsity-aware quantization. arXiv preprint arXiv:2105.11010, 2021. [32] L. Cavigelli, G. Rutishauser, and L. Benini, "EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 4, pp. 723-734, Dec. 2019	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87567	-
dc.description.abstract	隨著現在深度學習的蓬勃發展，深度學習已經是解決各種問題的重要方法。然而，深度學習的運算量非常龐大，以往只能在伺服器上做運算。然而近年來，大資料時代讓資料量呈現指數性的成長。如果我們只於伺服器上做深度學習運算，我們必須面對傳輸資料時間過長和資料隱私的問題。為了解決上述問題，大多研究都指向把深度學習實做在邊緣端，利用深度學習加速器提高運算效能。邊緣端的深度學習加速器仍然需要克服許多困難，最重要的是其高能耗的特性。其能量消耗來自兩個原因，一個是運算上的消耗，另一個是在資料傳輸上的消耗，而後者也是大家往往所忽略的。在深度學習加速器上，每當進行一層運算時，我們時常需要先把激活值的結果從動態隨機存取記憶體 (DRAM) 中取出，運算後再將結果放進DRAM中，因此造成高能量消耗。針對這個問題，本文利用資料壓縮的方式，將輸出激活值壓縮，以減少能量的消耗。本文會利用激活值有很高稀疏性的特性，使用零資料壓縮 (Zero-value Compression, ZVC)技術，此外我們還會搭配塊狀壓縮 (Block Compression, BC) 和繞過機制 (Bypass Mechanism)，讓壓縮率來到2.39倍。另外，我們也提出K有損壓縮 (K-lossy Compression)，在只降低0.4%準確率的情況下，讓壓縮率來到3.73倍。最後，我們會結合上述提及的演算法優化技術，提出一可調整架構(Scalable architecture)的資料壓縮/解壓縮引擎，相較於代表作，吞吐量提高19%，並只有增加8%的面積。最後用DRAMSim2來驗證此引擎能降低56%在DRAM資料傳輸上的消耗。	zh_TW
dc.description.abstract	As the development of deep learning (DL) has become more and more popular, DL has become an important solution to different kinds of problems. However, DL requires a large amount of computation, which can be computed on the cloud. In recent years, the number of data increases exponentially. Thus, cloud-based DL systems face the challenge of large data transmissions and data privacy leakages. To address these issues, most of the research aims to move the inferencing of the DL system to edge devices. The DL accelerators are developed to enhance the computational efficiency of the inferencing process. However, the DLA consumes a lot of energy. There are two aspects to reducing energy consumption: computation and data transmission. We will focus on reducing the energy consumption of the data transmission as it is the bottleneck in the current DLA. When computing a layer in the DLA, the activations are fetched from the DRAM. After computations in DLA, the output activations are stored back in DRAM. The data transmission between the DLA and DRAM causes high energy consumption. In this thesis, we use activation compression (AC) techniques to reduce the transmission between the DLA and DRAM and thus reduce the overall energy consumption. We exploit the high sparsity of activations generated from the ReLU function. The zero-value compression (ZVC) is combined with the block compression and the bypass mechanism. It can achieve a ×2.39 compression ratio. We also propose two K-lossy compression techniques, that is mixed-K lossy compression and K-lossy aware training. With 0.4% accuracy drops, we can achieve a ×3.73 compression ratio. Finally, combining the above algorithms, we propose a scalable architecture and implement it with hardware. The proposed scalable architecture can outperform the state-of-the-art by increasing by 19% throughput with 8% hardware overhead. The overall system’s energy consumption is also verified with DRAMSim2, showing that our method reduces read energy and write energy by 56%.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-06-20T16:06:21Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-06-20T16:06:21Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 vii 摘要 ix ABSTRACT xi CONTENTS xiii LIST OF FIG.URES xvi LIST OF TABLES xx Chapter 1 Introduction 1 1.1 Background 1 1.1.1 The Background of Deep learning 1 1.1.2 Deep Learning from Cloud to Edge 3 1.1.3 Deep Learning Accelerator (DLA) 5 1.2 Motivation and Main Contributions 6 1.2.1 The architecture of DLA 6 1.2.2 The bottleneck of DLA 7 1.2.3 Thesis Target 8 1.3 Thesis Organization 9 Chapter 2 Review of Activation Compression 10 2.1 Related Works of Activation Compression 10 2.1.1 Entropy Coding 10 2.1.2 Zero-value compression (ZVC) 11 2.1.3 Zero Run-length Coding (Z-RLC) 13 2.1.4 Lane Compression 16 2.2 Challenges of the Prior Works 18 2.3 Summary 19 Chapter 3 Algorithm of Compression 20 3.1 Proposed Bit-level ZVC 20 3.1.1 Block Compression 20 3.1.2 Bypass Mechanism 23 3.2 K-Lossy Compression 26 3.2.1 Analysis of K-Lossy 26 3.2.2 Mixed K-lossy Compression 30 3.3 K-lossy-aware Training 35 3.3.1 Add K-lossy Noise 35 3.3.2 Simulation Result 36 3.4 Budget-aware Compression 39 3.4.1 Problem Formulation 39 3.4.2 Budget-aware Compression 42 3.4.3 Simulation Result 44 3.5 Summary 45 Chapter 4 Hardware IP and System Analysis 46 4.1 Hardware of Compressor 46 4.1.1 The Architecture of Compressor 46 4.1.2 Basic Function 48 4.1.3 ZVC and Block Compression 51 4.1.4 Proposed Scalable Bit-level Non-Zero Values Concatenation 53 4.1.5 Output Wrapper 55 4.2 Hardware of Decompressor 57 4.2.1 The Architecture of Decompressor 57 4.2.2 Input Unpacker 59 4.2.3 The Decompression Method 60 4.2.4 Proposed Scalable Bit-level Decompressor 61 4.3 Performance Results 63 4.3.1 Performance Analysis of Different Sizes of Sub-block 63 4.3.2 Performance Analysis of Related Work 64 4.4 System Analysis 65 4.4.1 Power analysis 65 4.4.2 DRAMSim2 66 4.4.3 Simulation Results 68 4.5 Summary 70 Chapter 5 Main Contribution and Future Directions 71 5.1 Main Contribution 71 5.2 Future Directions 72 REFERENCE 73	-
dc.language.iso	en	-
dc.subject	可調整架構	zh_TW
dc.subject	有損壓縮	zh_TW
dc.subject	塊狀壓縮	zh_TW
dc.subject	激活值壓縮	zh_TW
dc.subject	零資料壓縮	zh_TW
dc.subject	繞過機制	zh_TW
dc.subject	Zero-value compression	en
dc.subject	Bypass Mechanism	en
dc.subject	K-lossy	en
dc.subject	Block Compression	en
dc.subject	Activation compression	en
dc.subject	scalable architecture	en
dc.title	基於稀疏性之低記憶體使用量激活值壓縮引擎設計	zh_TW
dc.title	Sparsity-based Activation Compression Engine Design for Low-memory Access in DLA	en
dc.type	Thesis	-
dc.date.schoolyear	111-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	盧奕璋;沈中安	zh_TW
dc.contributor.oralexamcommittee	Yi-Chang Lu;Chung-An Shen	en
dc.subject.keyword	激活值壓縮,零資料壓縮,塊狀壓縮,繞過機制,有損壓縮,可調整架構,	zh_TW
dc.subject.keyword	Activation compression,Zero-value compression,Block Compression,Bypass Mechanism,K-lossy,scalable architecture,	en
dc.relation.page	76	-
dc.identifier.doi	10.6342/NTU202210009	-
dc.rights.note	未授權	-
dc.date.accepted	2022-10-31	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-1.pdf 未授權公開取用	4.11 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。