請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100980| 標題: | 三元權重變形器神經網路加速電路之設計與晶片實現 Ternary-Weight Transformer Acceleration Circuit Design and Chip Implementation |
| 作者: | 陳法諭 Fa-Yu Chen |
| 指導教授: | 闕志達 Tzi-Dar Chiueh |
| 關鍵字: | 三元權重,MultimaxQGELU高能效加法近似電路 Ternary Weights,MultimaxQGELUEnergy EfficiencyApproximate Adder Circuits |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 近年來,變形器神經網路(Transformer)由於其優異的表現,被廣泛應用於電腦視覺、自然語言處裡、語音辨識等各式應用領域,並基於最初的transformer模型,針對各式任務的需求延伸出多種變體模型。儘管transformer在多個應用領域都有著超越傳統非transformer模型的表現,但高昂的硬體建置成本與模型推論時的能耗需求,都使得transformer系列模型在邊緣裝置與嵌入式系統等對於算力跟能耗有高度限制場域的部署成為一大挑戰。因此,在transformer模型表現日新月異的同時,許多研究也著眼於尋找在模型效能與硬體成本間的平衡點,其中,如何降低模型參數量與推論時的能耗,以便於邊緣運算裝置的部署,便是我們希望能解決的問題。
本文以維持transformer模型效能為首要目標,並以此為基礎追求大幅降低模型參數量、硬體成本、推論能耗的解決方案。針對transformer模型中的運算,提出了包括三元(Ternary)權重量化、非線性函數(包括Softmax與GELU)近似、加法近似等技巧,並完成相對應硬體加速器晶片的設計與實現。並透過量化感知訓練的方式,確保在施加上述近似技巧後,模型仍盡可能保持原有效能。 首先是三元權重量化,我們將transformer模型的權重(Weight)量化至-1, 0與1,而輸入(Activation)則是採用INT-4的量化精度。這樣的精度設定相比二元(Binary)權重加上INT-4輸入的組合經實驗證明可以有效降低模型的效能損失,同時不會顯著增加硬體的面積與功耗。 在非線性函數近似的方面,我們針對Softmax函數提出了硬體友善的近似版本,Multimax,透過僅計算單一矩陣row中最大的N個數值並賦予相對應比例,可以大幅降低面積與功耗。以Multimax計算電路本身而言,相比於32位元單精度浮點數的Softmax電路,面積與功耗分別可以降低95.8%與97.2%。本晶片可依照軟體模擬時的參數設定,支援Multimax的N為1, 3或5。另一個非線性函數,GELU,我們則是提出以分段二項式進行近似的QGELU,並將二項式中的參數設計為硬體僅需Shift & Add便可以完成的運算,讓整體成本進一步降低。以QGELU計算電路本身而言,相比於32位元單精度浮點數的GELU電路,面積與功耗分別可以降低98.3%與99.9%。 最後是加法電路的近似,本文提出多模態加法近似,可以支援精確運算、LSB近似加法以及末兩位近似加法運算,可以在對於運算精確度較不敏感的transformer運算中採用較大程度的近似,以節省推論時的功耗,並在對於精確度較敏感的部分維持高精確度的加法運算以維持模型效能。近似加法器我們採用OAI21與AAI21的混和雙層近似加法器架構,相比普通加法器,面積與功耗分別可以下降86.2%與90.8%。若以單一一個PE Cube進行分析,開啟LSB近似加法與末兩位近似加法運算時分別可以節省4.3%與21.4%的功耗。 本研究所提出之加速晶片於40奈米製程與200MHz的工作頻率下,可以達到32.3 TOPS/W的能源效率與1.50 TOPS/{mm}^2的面積效率,超越其他研究所提出之神經網路加速晶片。而使用ViT-Base進行inference時,運算效能可達1003.1 Tokens/J。 In recent years, Transformer neural networks have achieved remarkable success in computer vision, natural language processing, and speech recognition, resulting in numerous task-specific variants. Despite their superior performance, Transformers face challenges of high hardware cost and energy consumption, hindering deployment in resource-constrained environments such as edge devices. As performance continues to improve, research increasingly focuses on balancing accuracy and efficiency. This work addresses the key challenge of reducing parameters and inference power consumption to enable deployment on edge platforms. This work focuses on preserving Transformer performance while reducing parameters, hardware cost, and inference energy consumption. We introduce optimization techniques targeting key computations, including ternary weight quantization, nonlinear function approximations (Softmax, GELU), and approximate adders. A hardware accelerator chip is designed to support these methods, with quantization-aware training ensuring minimal accuracy loss. We first apply ternary weight quantization, mapping weights to [-1, 0, 1] with activations in 4-bit precision (INT4). Compared to binary weights with INT4 activations, this setting reduces performance degradation while incurring minimal hardware and power overhead. For nonlinear function approximation, we propose Multimax, a hardware-friendly alternative to Softmax that selects the top-N values and assigns proportional weights. Compared to 32-bit Softmax, Multimax reduces area by 95.8% and power by 97.2%, with configurable N (1, 3, or 5). Regarding the GELU activation function, we introduce QGELU, a segmented polynomial approximation designed for efficient hardware implementation. The polynomial coefficients are carefully chosen to allow computation using only shift-and-add operations, further reducing complexity. Compared to a standard 32-bit floating-point GELU circuit, QGELU achieves a 98.3% reduction in area and a 99.9% reduction in power consumption. Lastly, we propose a multi-mode approximate adder supporting exact, LSB, and two-bit approximations to reduce inference energy. By applying lower-precision modes to less accuracy-sensitive stages, the design balances efficiency and performance. Implemented with a two-layer OAI21–AAI21 hybrid structure, it achieves 86.2% area and 90.8% power savings over conventional adders. At the PE cube level, LSB and two-bit modes yield 4.3% and 21.4% power savings, respectively. The accelerator chip proposed in this study, implemented using a 40nm process and operating at a frequency of 200 MHz, achieves an energy efficiency of 32.3 TOPS/W and an area efficiency of 1.50 TOPS/mm², outperforming other neural network accelerators reported in previous works. When performing inference with ViT-Base, the efficiency reaches 1003.1 tokens/J. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100980 |
| DOI: | 10.6342/NTU202504546 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2025-11-27 |
| 顯示於系所單位: | 電子工程學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-114-1.pdf | 10.22 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
