高效能二值權重變形器神經網路加速器之設計與晶片實現

陳俊瑋; Jun-Wei Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96195

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	闕志達	zh_TW
dc.contributor.advisor	Tzi-Dar Chiueh	en
dc.contributor.author	陳俊瑋	zh_TW
dc.contributor.author	Jun-Wei Chen	en
dc.date.accessioned	2024-11-28T16:07:59Z	-
dc.date.available	2024-11-29	-
dc.date.copyright	2024-11-28	-
dc.date.issued	2024	-
dc.date.submitted	2024-09-22	-
dc.identifier.citation	[1] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211-252, 2015. [2] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [3] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778. [4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. [6] Y. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017. [7] Y.-H. Chen, T.-J. Yang, J. S. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019. [8] Y. Wang et al., “A 28 nm 27.5 TOPS/W approximate-computing-based transformer processor with asymptotic sparsity speculating and out-oforder computing,” in Int. Solid-State Circuits Conf. Digest Tech. Papers, 2022, pp. 1–3. [9] Keller, Ben, et al. "A 17–95.6 TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm." in Proc. of 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2022. [10] T. Tambe et al., “22.9 A 12nm 18.1TFLOPs/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management,” in Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2023, pp. 342–344. [11] Wang, Yang, et al. "A 28nm 77.35 TOPS/W Similar Vectors Traceable Transformer Processor with Principal-Component-Prior Speculating and Dynamic Bit-wise Stationary Computing." in Proc. of 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2023. [12] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and M. Shafique, “SwiftTron: An efficient hardware accelerator for quantized transformers,” in Proc. of Int. Joint Conf. Neural Netw., Gold Coast, Australia, 2023, pp. 1–9. [13] L. Xuanqing, Y. Hsiang-Fu, D. Inderjit, and H. Cho-Jui, “Learning to encode position for transformer with continuous dynamical model,” in Proc. of Int. Conf. Mach. Learn., 2020, pp. 6327–6335. [14] Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016). [15] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proc. of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 2015, pp.448-456. [16] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016). [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. "An image is worth 16x16 words: Transformers for image recognition at scale. " arXiv preprint arXiv:2010.11929, 2020. [18] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. of Eur. Conf. Comput. Vis., 2020, pp. 213–229. [19] “ImageNet,” Image-net.org, 2017. http://image-net.org. [20] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. of Int. Conf. Mach. Learn., 2021, pp. 10347– 10357. [21] M. Ding, B. Xiao, N. Codella, P. Luo, J. Wang, and L. Yuan, “Davit: Dual attention vision transformers,” in Proc. of Eur. Conf. Comput. Vis., 2022, pp. 74–92. [22] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." in Proc. of the IEEE/CVF international conference on computer vision. 2021. [23] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017). [24] Z. Xu et al., "Recu: Reviving the dead weights in binary neural networks," in Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5198 5208. [25] Bhalgat, Yash, et al., "Lsq+: Improving low-bit quantization through learnable offsets and better initialization," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), 2020, pp. 696-697. [26] Esser, Steven K., et al., "Learned step size quantization," arXiv preprint arXiv:1902.08153, Feb. 2019. [27] B. Jacob et al., "Quantization and training of neural networks for efficient integerarithmetic-only inference," in Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713. [28] Y. Bengio, N. Léonard, and A. Courville, "Estimating or propagating gradients through stochastic neurons for conditional computation," arXiv preprint arXiv:1308.3432, 2013. [29] M. Sun et al., "Vaqf: Fully automatic software-hardware co-design framework for low-bit vision transformer," arXiv preprint arXiv:2201.06618, 2022. [30] Huang, Yu-Hsiang, Pei-Hsuan Kuo, and Juinn-Dar Huang. "Hardware-Friendly Activation Function Designs and Its Efficient VLSI Implementations for Transformer-Based Applications." in Proc. of 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2023. [31] Dave, Abhilasha, et al. "HW/SW codesign for approximation-aware binary neural networks." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 13.1 (2023): 33-47. [32] F. Tu et al., “A 28 nm 15.59 µJ/token full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 466–468. [33] Shao, Haikuo, et al. "An Efficient Training Accelerator for Transformers With Hardware-Algorithm Co-Optimization." IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2023). [34] Mun, HanGyeol, et al. "A 28 nm 66.8 TOPS/W Sparsity-Aware Dynamic-Precision Deep-Learning Processor." in Proc. of 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2023. [35] Lee, Sunwoo, Jeongwoo Park, and Dongsuk Jeon. "A 4.27 TFLOPS/W FP4/FP8 Hybrid-Precision Neural Network Training Processor Using Shift-Add MAC and Reconfigurable PE Array." in Proc. of ESSCIRC 2023-IEEE 49th European Solid State Circuits Conference (ESSCIRC). IEEE, 2023. [36] Qin, Yubin, et al. "A 28nm 49.7 TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination." in Proc. of 2023 IEEE Asian Solid-State Circuits Conference (A-SSCC). IEEE, 2023. [37] Kim, Sangyeob, et al. "20.5 C-Transformer: A 2.6-18.1 μJ/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models." in 2024 IEEE International Solid-State Circuits Conference (ISSCC). Digest Tech. Papers, Vol. 67. IEEE, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96195	-
dc.description.abstract	隨著人工智慧與其應用的發展，神經網路的運算也日益複雜。尤其是近年生成式AI的快速崛起，使得變形器神經網路(Transformer)的應用逐漸普及。各界也開始用它來發展各式各樣的應用，包括圖像分類、自然語言處理(NLP)與圖像生成的應用。儘管相較於傳統神經網路，Transformer擁有更好的表現，但其運算量以及運算複雜度也遠遠超出預期，同時也需要更加龐大的運算資源與記憶體來支援其訓練與推演。因此，近期的研究重點大多聚焦在低功率、高效能的追求，以及在邊緣裝置(edge device)上的應用等方面。本文基於對生成式AI的高效能追求，特別針對Transformer提出了二值權重運算電路、非線性數學函式的簡化以及近似運算電路。這些方式從硬體層面來看都可以擁有良好的效果，達到了高效能的目的，我們也從實驗上證明了使用這些技術之後的正確率損失始終保持在可接受的範圍。在二值權重電路的部分，我們利用量化(quantization)的技術將所有權重(weight)都量化成1-bit，即只有+1和-1這兩種數值，而輸入(activation)的部分則是因為考量到正確率損失，因此將其量化成4-bit，數值範圍為-7到+7。透過這項技術，我們的乘法器結構將變得非常簡單，從整個晶片架構(1W4A)上來看，與傳統的8W8A的架構相比，我們的電路面積與功率分別可以下降72%以及77%。若以整個系統的角度進行分析，對於儲存權重所需的晶片外記憶體空間可以大幅度的下降為原本的1/8，而晶片執行運算的latency以及off-chip memory access也分別可以減少82%與78%。在Transformer當中的非線性運算，我們針對Softmax提出了簡化的方式，稱為(Hard)max，可直接省去指數以及除法運算，對於晶片的面積及功率分別可以節省21%與37%。此外，也可以直接省略一個矩陣乘法運算(變為Select_(Row of V))，對於整個Transformer的運算量以及運算時間也有一定程度的節省，在latency以及off-chip memory access又可以分別節省16%與28%。而針對GELU這個數學函式我們也提出了FGELU來取代，徹底將Transformer當中的指數與除法運算移除，且對於兩者所造成的正確率損失也可以透過訓練的方式降至1%以內。最後，本文也提出了基於近似電路的雙模加法器，在對於準確度較不敏感的運算中，例如只要尋找最大值的(Hard)max(QK^T)，可透過gating的方式關閉其它電路，並只利用近似加法器(Approximate Adder)來支援此運算，而其他對於準確度較敏感的運算，則開啟每個電路來執行運算，以此來維持運算的準確度。也就是說，這個電路可以在特定情況下使用較少運算單元來執行運算，以節省無謂的能量消耗，最多可節省14%的功率消耗。	zh_TW
dc.description.abstract	With the advancement of artificial intelligence and its applications, the computations involved in neural networks are becoming increasingly complex. Particularly, the rapid emergence of generative AI in recent years has led to the widespread adoption of Transformer. These networks are now being applied in various domains, including image classification, natural language processing, and image generation. Despite the superior performance of Transformers compared to traditional neural networks, their computational requirements and complexity have surpassed initial expectations. They also demand larger computational resources and memory to support training and inference. Consequently, recent research efforts have been focused on achieving low power consumption and high energy efficiency, as well as exploring applications on edge devices. Our work, driven by the pursuit of high efficiency in generative AI, specifically addresses Transformers by proposing binary weight computation circuits, simplifications of nonlinear mathematical functions and approximate computation circuits. These approaches demonstrate promising hardware-level effects, achieving the goal of high energy efficiency. Additionally, experimental results consistently show that the accuracy loss incurred by employing these techniques remains within an acceptable range. In binary weight circuits, we employ quantization techniques to quantize all weights into 1-bit (+1 and -1), while activations are quantized into 4-bit to minimize accuracy loss. Through this technique, our multiplier structure becomes significantly simplified. Compared to the traditional 8W8A architecture, our circuit(1W4A) area and power consumption decreased by 72% and 77%, respectively. Analyzing the entire system, the off-chip memory space required for storing weights is reduced to 1/8 of the original requirement. Additionally, the latency and off-chip memory access for chip operations are decreased by 82% and 78%, respectively. For the non-linear operations in Transformers, we propose a simplified method for Softmax, termed (Hard)max, which directly eliminates exponential and division operations. This approach saves 21% and 37% in circuit area and power consumption, respectively. Additionally, one matrix multiplication operation can be omitted (replaced with Select_(Row of V)), resulting in some savings in the computation amount and time of the entire Transformer. This leads to a 16% reduction in latency and a 28% reduction in off-chip memory access. Furthermore, we propose FGELU to replace GELU, completely removing exponential and division operations in Transformers. The accuracy loss caused by these replacements can be reduced to within 1% through training. Finally, we propose a dual-mode adder based on approximate circuits. For less accuracy-sensitive computations, such as finding the maximum value in (Hard)max(QK^T), other circuits can be gated off, and just using approximate adders to support these computations. Each circuit is activated to execute the calculation for more accuracy-sensitive computations, thereby maintaining accuracy. This circuit can thus use fewer computational units to perform computations in specific situations, saving up to 14% of power consumption.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-11-28T16:07:59Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-11-28T16:07:59Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 摘要 iii Abstract v 目次 ix 圖次 xii 表次 xv 第一章緒論 1 1.1 研究背景 1 1.2 研究動機與目標 2 1.3 論文組織與貢獻 3 第二章變形器神經網路(Transformer)介紹 5 2.1 多頭注意力(Multi-Head Attention) 6 2.2 多層感知機(Multilayer Perceptron, MLP) 11 2.3 正規化(Normalization) 13 2.4 相關應用 14 2.5 第二章總結 15 第三章二值化與近似運算神經網路訓練與推論 17 3.1 ImageNet資料集 17 3.2 Vision Transformer簡介 18 3.2.1 ViT、DeiT 18 3.2.2 DaViT 21 3.2.3 Swin 22 3.3 二值權重神經網路簡介與實驗結果 23 3.3.1 神經網路量化 23 3.3.1.1 權重二值化 26 3.3.1.2 習得步長量化(Learned Step Size Quantization, LSQ) 27 3.3.2 二值權重神經網路訓練 29 3.3.3 訓練結果 31 3.4 非線性數學運算簡化 32 3.4.1 Softmax to (Hard)max 33 3.4.1.1 特色與優勢 34 3.4.1.2 訓練方式與實驗結果 37 3.4.2 GELU to FGELU 40 3.5 近似運算 44 3.5.1 使用OAI21設計加法器電路 44 3.5.1.1 雙模加法器(Dual-Mode Adder) 45 3.5.1.2 訓練方式與實驗結果 47 3.5.2 設置Threshold增加sparsity 48 3.6 第三章總結 50 第四章硬體架構設計 53 4.1 整體規劃 53 4.1.1 運算電路規劃 53 4.1.2 記憶體配置 54 4.2 IPM Cube設計 55 4.2.1 IPM(Inner Product Module)設計 55 4.2.2 IPM Plane設計 57 4.2.3 整合 58 4.3 PPU(Post-Processing Unit)設計 60 4.3.1 累加器(Accumulator) 60 4.3.2 (Hard)max Unit 61 4.3.3 FGELU Unit 63 4.3.4 重新量化單元(Requantizer) 64 4.3.5 整合 65 4.4 計算流程 68 4.5 整體架構 70 4.6 第四章總結 74 第五章晶片實現 75 5.1 設計流程 75 5.2 驗證流程 76 5.3 布局圖 77 5.4 模擬數據摘要 79 5.5 模擬數據比較 82 5.5.1 量化精度 82 5.5.2 Softmax實作方式 85 5.5.3 近似電路 87 5.5.4 ImageNet 1000正確率 89 5.5.5 其他 90 5.5.6 總體效能 91 5.6 量測考量與結果 93 5.7 第五章總結 96 第六章結論與展望 99 參考文獻 101	-
dc.language.iso	zh_TW	-
dc.subject	二值化權重	zh_TW
dc.subject	高效能	zh_TW
dc.subject	雙模	zh_TW
dc.subject	近似電路	zh_TW
dc.subject	(Hard)max	zh_TW
dc.subject	approximate circuit	en
dc.subject	(Hard)max	en
dc.subject	high energy efficiency	en
dc.subject	binsry weight	en
dc.subject	Dual-Mode	en
dc.title	高效能二值權重變形器神經網路加速器之設計與晶片實現	zh_TW
dc.title	Design and Implementation of an Energy-Efficient Binary-Weight Transformer Accelerator Chip	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	楊家驤;劉宗德;蔡佩芸	zh_TW
dc.contributor.oralexamcommittee	Chia-Hsiang Yang;Tsung-Te Liu;Pei-Yun Tsai	en
dc.subject.keyword	高效能,二值化權重,(Hard)max,近似電路,雙模,	zh_TW
dc.subject.keyword	high energy efficiency,binsry weight,(Hard)max,approximate circuit,Dual-Mode,	en
dc.relation.page	107	-
dc.identifier.doi	10.6342/NTU202404328	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-09-23	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
dc.date.embargo-lift	2026-10-01	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 未授權公開取用	5.49 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。