高效能二值權重變形器神經網路加速器之設計與晶片實現

陳俊瑋; Jun-Wei Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96195

標題:	高效能二值權重變形器神經網路加速器之設計與晶片實現 Design and Implementation of an Energy-Efficient Binary-Weight Transformer Accelerator Chip
作者:	陳俊瑋 Jun-Wei Chen
指導教授:	闕志達 Tzi-Dar Chiueh
關鍵字:	高效能,二值化權重,(Hard)max,近似電路,雙模, high energy efficiency,binsry weight,(Hard)max,approximate circuit,Dual-Mode,
出版年 :	2024
學位:	碩士
摘要:	隨著人工智慧與其應用的發展，神經網路的運算也日益複雜。尤其是近年生成式AI的快速崛起，使得變形器神經網路(Transformer)的應用逐漸普及。各界也開始用它來發展各式各樣的應用，包括圖像分類、自然語言處理(NLP)與圖像生成的應用。儘管相較於傳統神經網路，Transformer擁有更好的表現，但其運算量以及運算複雜度也遠遠超出預期，同時也需要更加龐大的運算資源與記憶體來支援其訓練與推演。因此，近期的研究重點大多聚焦在低功率、高效能的追求，以及在邊緣裝置(edge device)上的應用等方面。本文基於對生成式AI的高效能追求，特別針對Transformer提出了二值權重運算電路、非線性數學函式的簡化以及近似運算電路。這些方式從硬體層面來看都可以擁有良好的效果，達到了高效能的目的，我們也從實驗上證明了使用這些技術之後的正確率損失始終保持在可接受的範圍。在二值權重電路的部分，我們利用量化(quantization)的技術將所有權重(weight)都量化成1-bit，即只有+1和-1這兩種數值，而輸入(activation)的部分則是因為考量到正確率損失，因此將其量化成4-bit，數值範圍為-7到+7。透過這項技術，我們的乘法器結構將變得非常簡單，從整個晶片架構(1W4A)上來看，與傳統的8W8A的架構相比，我們的電路面積與功率分別可以下降72%以及77%。若以整個系統的角度進行分析，對於儲存權重所需的晶片外記憶體空間可以大幅度的下降為原本的1/8，而晶片執行運算的latency以及off-chip memory access也分別可以減少82%與78%。在Transformer當中的非線性運算，我們針對Softmax提出了簡化的方式，稱為(Hard)max，可直接省去指數以及除法運算，對於晶片的面積及功率分別可以節省21%與37%。此外，也可以直接省略一個矩陣乘法運算(變為Select_(Row of V))，對於整個Transformer的運算量以及運算時間也有一定程度的節省，在latency以及off-chip memory access又可以分別節省16%與28%。而針對GELU這個數學函式我們也提出了FGELU來取代，徹底將Transformer當中的指數與除法運算移除，且對於兩者所造成的正確率損失也可以透過訓練的方式降至1%以內。最後，本文也提出了基於近似電路的雙模加法器，在對於準確度較不敏感的運算中，例如只要尋找最大值的(Hard)max(QK^T)，可透過gating的方式關閉其它電路，並只利用近似加法器(Approximate Adder)來支援此運算，而其他對於準確度較敏感的運算，則開啟每個電路來執行運算，以此來維持運算的準確度。也就是說，這個電路可以在特定情況下使用較少運算單元來執行運算，以節省無謂的能量消耗，最多可節省14%的功率消耗。 With the advancement of artificial intelligence and its applications, the computations involved in neural networks are becoming increasingly complex. Particularly, the rapid emergence of generative AI in recent years has led to the widespread adoption of Transformer. These networks are now being applied in various domains, including image classification, natural language processing, and image generation. Despite the superior performance of Transformers compared to traditional neural networks, their computational requirements and complexity have surpassed initial expectations. They also demand larger computational resources and memory to support training and inference. Consequently, recent research efforts have been focused on achieving low power consumption and high energy efficiency, as well as exploring applications on edge devices. Our work, driven by the pursuit of high efficiency in generative AI, specifically addresses Transformers by proposing binary weight computation circuits, simplifications of nonlinear mathematical functions and approximate computation circuits. These approaches demonstrate promising hardware-level effects, achieving the goal of high energy efficiency. Additionally, experimental results consistently show that the accuracy loss incurred by employing these techniques remains within an acceptable range. In binary weight circuits, we employ quantization techniques to quantize all weights into 1-bit (+1 and -1), while activations are quantized into 4-bit to minimize accuracy loss. Through this technique, our multiplier structure becomes significantly simplified. Compared to the traditional 8W8A architecture, our circuit(1W4A) area and power consumption decreased by 72% and 77%, respectively. Analyzing the entire system, the off-chip memory space required for storing weights is reduced to 1/8 of the original requirement. Additionally, the latency and off-chip memory access for chip operations are decreased by 82% and 78%, respectively. For the non-linear operations in Transformers, we propose a simplified method for Softmax, termed (Hard)max, which directly eliminates exponential and division operations. This approach saves 21% and 37% in circuit area and power consumption, respectively. Additionally, one matrix multiplication operation can be omitted (replaced with Select_(Row of V)), resulting in some savings in the computation amount and time of the entire Transformer. This leads to a 16% reduction in latency and a 28% reduction in off-chip memory access. Furthermore, we propose FGELU to replace GELU, completely removing exponential and division operations in Transformers. The accuracy loss caused by these replacements can be reduced to within 1% through training. Finally, we propose a dual-mode adder based on approximate circuits. For less accuracy-sensitive computations, such as finding the maximum value in (Hard)max(QK^T), other circuits can be gated off, and just using approximate adders to support these computations. Each circuit is activated to execute the calculation for more accuracy-sensitive computations, thereby maintaining accuracy. This circuit can thus use fewer computational units to perform computations in specific situations, saving up to 14% of power consumption.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96195
DOI:	10.6342/NTU202404328
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2026-10-01
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 未授權公開取用	5.49 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。