應用於 Transformer 神經網路之高能效深度學習處理器晶片

吳秉陞; Ping-Sheng Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92338

標題:	應用於 Transformer 神經網路之高能效深度學習處理器晶片 An Energy-Efficient Learning Processor for Transformer-Based Neural Networks
作者:	吳秉陞 Ping-Sheng Wu
指導教授:	楊家驤 Chia-Hsiang Yang
關鍵字:	Transformer,自注意力機制,神經網路訓練,梯度近似,基於預測機制之運算省略,數位積體電路, Transformer,Self-Attention,Training,Gradient Approximation,Speculative Computation Skipping,Digital CMOS Integrated Circuits,
出版年 :	2024
學位:	碩士
摘要:	以Transformer架構作為基礎的神經網路因其多功能性與高性能而被廣泛應用，並成為目前大語言模型之通用核心。然而此類型神經網路以自注意力機制為中心的架構具高運算複雜度，也因此限制了Transformer相關網路於邊緣裝置上的部署，且對存在訓練需求之應用影響尤為顯著。本論文提出文獻中第一個可同時支援Transformer推論與訓練加速的學習處理器，藉由演算法與硬體架構之偕同優化降低整體訓練所需複雜度最高達94.2%。本設計引入基於單次抽樣的梯度近似法，於不影響訓練收斂的情形下減少99.6%注意力分數梯度運算量，大幅度消弭該運算導致的運算瓶頸。並利用一基於三元化向量的預測技術標記較不具文義重要性之資料，可於最終訓練表現差異1.2%內的情況下省略50至80%資料之相關運算。此外，本設計之運算採用基於Token的成組浮點數格式，使得單一乘加運算之功耗降低39至60%。所提出之處理器以40奈米製程設計與實作，晶片核心面積為5.45mm^2。操作於10-200MHz工作頻率、0.6-1.16V供應電壓時，功耗為10-119mW。晶片於46MHz, 0.64V的操作下可達到最高為 99.2TOPS/W的能量效率。與過往文獻中的最佳Transformer推論加速器晶片相比，本論文所提出的設計可同時支援推論與訓練，能量效率方面亦超越過往設計達2.6至162倍。 Transformer-based neural networks, being the foundation of recent large language models, have dominated a multitude of ML domains due to their versatility and high performance. However, the self-attention-centric structure of Transformer-based networks results in substantial computational complexity, which hinders their deployment on edge devices, especially for applications regarding training. This paper proposes the first Transformer learning processor supporting both inference and training acceleration, achieving an up to 94.2% reduction on overall training complexity with algorithm-architecture co-optimizations. First, a gradient approximation utilizing one-shot sampling is introduced, reducing 99.6% multiply-accumulate operations (MACs) for computing attention score gradients without damaging the training convergence. Second, a ternary vector-based speculation enables computation skipping by removing 50-to-80% dependencies within data which are of little contextual significance, while maintaining a <1.2% training performance degradation. Additionally, adoption of token-based block floating-point arithmetic results in 39-to-60% power reduction per MAC. Fabricated in a 40nm CMOS technology, the proposed Transformer learning processor, with a core area of 5.45mm^2, consumes 10-to-119mW at a clock frequency of 10-to-200MHz from a 0.6-to-1.16V supply. It delivers a maximum energy efficiency of 99.2TOPS/W at 46MHz under 0.64V. Compared with the state-of-the-art inference-only Transformer processors, the proposed design achieves 2.6-to-162 times higher energy efficiency, while supporting both inference and training.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92338
DOI:	10.6342/NTU202400133
全文授權:	未授權
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf 目前未授權公開取用	9.5 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。