Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電子工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96195
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor闕志達zh_TW
dc.contributor.advisorTzi-Dar Chiuehen
dc.contributor.author陳俊瑋zh_TW
dc.contributor.authorJun-Wei Chenen
dc.date.accessioned2024-11-28T16:07:59Z-
dc.date.available2024-11-29-
dc.date.copyright2024-11-28-
dc.date.issued2024-
dc.date.submitted2024-09-22-
dc.identifier.citation[1] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211-252, 2015.
[2] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[3] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[6] Y. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017.
[7] Y.-H. Chen, T.-J. Yang, J. S. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
[8] Y. Wang et al., “A 28 nm 27.5 TOPS/W approximate-computing-based transformer processor with asymptotic sparsity speculating and out-oforder computing,” in Int. Solid-State Circuits Conf. Digest Tech. Papers, 2022, pp. 1–3.
[9] Keller, Ben, et al. "A 17–95.6 TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm." in Proc. of 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2022.
[10] T. Tambe et al., “22.9 A 12nm 18.1TFLOPs/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management,” in Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2023, pp. 342–344.
[11] Wang, Yang, et al. "A 28nm 77.35 TOPS/W Similar Vectors Traceable Transformer Processor with Principal-Component-Prior Speculating and Dynamic Bit-wise Stationary Computing." in Proc. of 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2023.
[12] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and M. Shafique, “SwiftTron: An efficient hardware accelerator for quantized transformers,” in Proc. of Int. Joint Conf. Neural Netw., Gold Coast, Australia, 2023, pp. 1–9.
[13] L. Xuanqing, Y. Hsiang-Fu, D. Inderjit, and H. Cho-Jui, “Learning to encode position for transformer with continuous dynamical model,” in Proc. of Int. Conf. Mach. Learn., 2020, pp. 6327–6335.
[14] Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016).
[15] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proc. of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 2015, pp.448-456.
[16] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016).
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. "An image is worth 16x16 words: Transformers for image recognition at scale. " arXiv preprint arXiv:2010.11929, 2020.
[18] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. of Eur. Conf. Comput. Vis., 2020, pp. 213–229.
[19] “ImageNet,” Image-net.org, 2017. http://image-net.org.
[20] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. of Int. Conf. Mach. Learn., 2021, pp. 10347– 10357.
[21] M. Ding, B. Xiao, N. Codella, P. Luo, J. Wang, and L. Yuan, “Davit: Dual attention vision transformers,” in Proc. of Eur. Conf. Comput. Vis., 2022, pp. 74–92.
[22] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." in Proc. of the IEEE/CVF international conference on computer vision. 2021.
[23] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017).
[24] Z. Xu et al., "Recu: Reviving the dead weights in binary neural networks," in Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5198 5208.
[25] Bhalgat, Yash, et al., "Lsq+: Improving low-bit quantization through learnable offsets and better initialization," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), 2020, pp. 696-697.
[26] Esser, Steven K., et al., "Learned step size quantization," arXiv preprint arXiv:1902.08153, Feb. 2019.
[27] B. Jacob et al., "Quantization and training of neural networks for efficient integerarithmetic-only inference," in Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713.
[28] Y. Bengio, N. Léonard, and A. Courville, "Estimating or propagating gradients through stochastic neurons for conditional computation," arXiv preprint arXiv:1308.3432, 2013.
[29] M. Sun et al., "Vaqf: Fully automatic software-hardware co-design framework for low-bit vision transformer," arXiv preprint arXiv:2201.06618, 2022.
[30] Huang, Yu-Hsiang, Pei-Hsuan Kuo, and Juinn-Dar Huang. "Hardware-Friendly Activation Function Designs and Its Efficient VLSI Implementations for Transformer-Based Applications." in Proc. of 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2023.
[31] Dave, Abhilasha, et al. "HW/SW codesign for approximation-aware binary neural networks." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 13.1 (2023): 33-47.
[32] F. Tu et al., “A 28 nm 15.59 µJ/token full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 466–468.
[33] Shao, Haikuo, et al. "An Efficient Training Accelerator for Transformers With Hardware-Algorithm Co-Optimization." IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2023).
[34] Mun, HanGyeol, et al. "A 28 nm 66.8 TOPS/W Sparsity-Aware Dynamic-Precision Deep-Learning Processor." in Proc. of 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2023.
[35] Lee, Sunwoo, Jeongwoo Park, and Dongsuk Jeon. "A 4.27 TFLOPS/W FP4/FP8 Hybrid-Precision Neural Network Training Processor Using Shift-Add MAC and Reconfigurable PE Array." in Proc. of ESSCIRC 2023-IEEE 49th European Solid State Circuits Conference (ESSCIRC). IEEE, 2023.
[36] Qin, Yubin, et al. "A 28nm 49.7 TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination." in Proc. of 2023 IEEE Asian Solid-State Circuits Conference (A-SSCC). IEEE, 2023.
[37] Kim, Sangyeob, et al. "20.5 C-Transformer: A 2.6-18.1 μJ/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models." in 2024 IEEE International Solid-State Circuits Conference (ISSCC). Digest Tech. Papers, Vol. 67. IEEE, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96195-
dc.description.abstract隨著人工智慧與其應用的發展,神經網路的運算也日益複雜。尤其是近年生成式AI的快速崛起,使得變形器神經網路(Transformer)的應用逐漸普及。各界也開始用它來發展各式各樣的應用,包括圖像分類、自然語言處理(NLP)與圖像生成的應用。儘管相較於傳統神經網路,Transformer擁有更好的表現,但其運算量以及運算複雜度也遠遠超出預期,同時也需要更加龐大的運算資源與記憶體來支援其訓練與推演。因此,近期的研究重點大多聚焦在低功率、高效能的追求,以及在邊緣裝置(edge device)上的應用等方面。

本文基於對生成式AI的高效能追求,特別針對Transformer提出了二值權重運算電路、非線性數學函式的簡化以及近似運算電路。這些方式從硬體層面來看都可以擁有良好的效果,達到了高效能的目的,我們也從實驗上證明了使用這些技術之後的正確率損失始終保持在可接受的範圍。

在二值權重電路的部分,我們利用量化(quantization)的技術將所有權重(weight)都量化成1-bit,即只有+1和-1這兩種數值,而輸入(activation)的部分則是因為考量到正確率損失,因此將其量化成4-bit,數值範圍為-7到+7。透過這項技術,我們的乘法器結構將變得非常簡單,從整個晶片架構(1W4A)上來看,與傳統的8W8A的架構相比,我們的電路面積與功率分別可以下降72%以及77%。若以整個系統的角度進行分析,對於儲存權重所需的晶片外記憶體空間可以大幅度的下降為原本的1/8,而晶片執行運算的latency以及off-chip memory access也分別可以減少82%與78%。

在Transformer當中的非線性運算,我們針對Softmax提出了簡化的方式,稱為(Hard)max,可直接省去指數以及除法運算,對於晶片的面積及功率分別可以節省21%與37%。此外,也可以直接省略一個矩陣乘法運算(變為Select_(Row of V)),對於整個Transformer的運算量以及運算時間也有一定程度的節省,在latency以及off-chip memory access又可以分別節省16%與28%。而針對GELU這個數學函式我們也提出了FGELU來取代,徹底將Transformer當中的指數與除法運算移除,且對於兩者所造成的正確率損失也可以透過訓練的方式降至1%以內。

最後,本文也提出了基於近似電路的雙模加法器,在對於準確度較不敏感的運算中,例如只要尋找最大值的(Hard)max(QK^T),可透過gating的方式關閉其它電路,並只利用近似加法器(Approximate Adder)來支援此運算,而其他對於準確度較敏感的運算,則開啟每個電路來執行運算,以此來維持運算的準確度。也就是說,這個電路可以在特定情況下使用較少運算單元來執行運算,以節省無謂的能量消耗,最多可節省14%的功率消耗。
zh_TW
dc.description.abstractWith the advancement of artificial intelligence and its applications, the computations involved in neural networks are becoming increasingly complex. Particularly, the rapid emergence of generative AI in recent years has led to the widespread adoption of Transformer. These networks are now being applied in various domains, including image classification, natural language processing, and image generation. Despite the superior performance of Transformers compared to traditional neural networks, their computational requirements and complexity have surpassed initial expectations. They also demand larger computational resources and memory to support training and inference. Consequently, recent research efforts have been focused on achieving low power consumption and high energy efficiency, as well as exploring applications on edge devices.

Our work, driven by the pursuit of high efficiency in generative AI, specifically addresses Transformers by proposing binary weight computation circuits, simplifications of nonlinear mathematical functions and approximate computation circuits. These approaches demonstrate promising hardware-level effects, achieving the goal of high energy efficiency. Additionally, experimental results consistently show that the accuracy loss incurred by employing these techniques remains within an acceptable range.

In binary weight circuits, we employ quantization techniques to quantize all weights into 1-bit (+1 and -1), while activations are quantized into 4-bit to minimize accuracy loss. Through this technique, our multiplier structure becomes significantly simplified. Compared to the traditional 8W8A architecture, our circuit(1W4A) area and power consumption decreased by 72% and 77%, respectively. Analyzing the entire system, the off-chip memory space required for storing weights is reduced to 1/8 of the original requirement. Additionally, the latency and off-chip memory access for chip operations are decreased by 82% and 78%, respectively.

For the non-linear operations in Transformers, we propose a simplified method for Softmax, termed (Hard)max, which directly eliminates exponential and division operations. This approach saves 21% and 37% in circuit area and power consumption, respectively. Additionally, one matrix multiplication operation can be omitted (replaced with Select_(Row of V)), resulting in some savings in the computation amount and time of the entire Transformer. This leads to a 16% reduction in latency and a 28% reduction in off-chip memory access. Furthermore, we propose FGELU to replace GELU, completely removing exponential and division operations in Transformers. The accuracy loss caused by these replacements can be reduced to within 1% through training.

Finally, we propose a dual-mode adder based on approximate circuits. For less accuracy-sensitive computations, such as finding the maximum value in (Hard)max(QK^T), other circuits can be gated off, and just using approximate adders to support these computations. Each circuit is activated to execute the calculation for more accuracy-sensitive computations, thereby maintaining accuracy. This circuit can thus use fewer computational units to perform computations in specific situations, saving up to 14% of power consumption.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-11-28T16:07:59Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-11-28T16:07:59Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents致謝 i
摘要 iii
Abstract v
目次 ix
圖次 xii
表次 xv
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機與目標 2
1.3 論文組織與貢獻 3
第二章 變形器神經網路(Transformer)介紹 5
2.1 多頭注意力(Multi-Head Attention) 6
2.2 多層感知機(Multilayer Perceptron, MLP) 11
2.3 正規化(Normalization) 13
2.4 相關應用 14
2.5 第二章總結 15
第三章 二值化與近似運算神經網路訓練與推論 17
3.1 ImageNet資料集 17
3.2 Vision Transformer簡介 18
3.2.1 ViT、DeiT 18
3.2.2 DaViT 21
3.2.3 Swin 22
3.3 二值權重神經網路簡介與實驗結果 23
3.3.1 神經網路量化 23
3.3.1.1 權重二值化 26
3.3.1.2 習得步長量化(Learned Step Size Quantization, LSQ) 27
3.3.2 二值權重神經網路訓練 29
3.3.3 訓練結果 31
3.4 非線性數學運算簡化 32
3.4.1 Softmax to (Hard)max 33
3.4.1.1 特色與優勢 34
3.4.1.2 訓練方式與實驗結果 37
3.4.2 GELU to FGELU 40
3.5 近似運算 44
3.5.1 使用OAI21設計加法器電路 44
3.5.1.1 雙模加法器(Dual-Mode Adder) 45
3.5.1.2 訓練方式與實驗結果 47
3.5.2 設置Threshold增加sparsity 48
3.6 第三章總結 50
第四章 硬體架構設計 53
4.1 整體規劃 53
4.1.1 運算電路規劃 53
4.1.2 記憶體配置 54
4.2 IPM Cube設計 55
4.2.1 IPM(Inner Product Module)設計 55
4.2.2 IPM Plane設計 57
4.2.3 整合 58
4.3 PPU(Post-Processing Unit)設計 60
4.3.1 累加器(Accumulator) 60
4.3.2 (Hard)max Unit 61
4.3.3 FGELU Unit 63
4.3.4 重新量化單元(Requantizer) 64
4.3.5 整合 65
4.4 計算流程 68
4.5 整體架構 70
4.6 第四章總結 74
第五章 晶片實現 75
5.1 設計流程 75
5.2 驗證流程 76
5.3 布局圖 77
5.4 模擬數據摘要 79
5.5 模擬數據比較 82
5.5.1 量化精度 82
5.5.2 Softmax實作方式 85
5.5.3 近似電路 87
5.5.4 ImageNet 1000正確率 89
5.5.5 其他 90
5.5.6 總體效能 91
5.6 量測考量與結果 93
5.7 第五章總結 96
第六章 結論與展望 99
參考文獻 101
-
dc.language.isozh_TW-
dc.subject二值化權重zh_TW
dc.subject高效能zh_TW
dc.subject雙模zh_TW
dc.subject近似電路zh_TW
dc.subject(Hard)maxzh_TW
dc.subjectapproximate circuiten
dc.subject(Hard)maxen
dc.subjecthigh energy efficiencyen
dc.subjectbinsry weighten
dc.subjectDual-Modeen
dc.title高效能二值權重變形器神經網路加速器之設計與晶片實現zh_TW
dc.titleDesign and Implementation of an Energy-Efficient Binary-Weight Transformer Accelerator Chipen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee楊家驤;劉宗德;蔡佩芸zh_TW
dc.contributor.oralexamcommitteeChia-Hsiang Yang;Tsung-Te Liu;Pei-Yun Tsaien
dc.subject.keyword高效能,二值化權重,(Hard)max,近似電路,雙模,zh_TW
dc.subject.keywordhigh energy efficiency,binsry weight,(Hard)max,approximate circuit,Dual-Mode,en
dc.relation.page107-
dc.identifier.doi10.6342/NTU202404328-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-09-23-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電子工程學研究所-
dc.date.embargo-lift2026-10-01-
顯示於系所單位:電子工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf
  未授權公開取用
5.49 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved