具適應智能之可程式化深度學習處理器

呂丞勛; Cheng-Hsun Lu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78852

標題:	具適應智能之可程式化深度學習處理器 A Fully-Programmable Deep Learning Processor with Adaptable Intelligence
作者:	呂丞勛 Cheng-Hsun Lu
指導教授:	楊家驤 Chia-Hsiang Yang
關鍵字:	深度學習,卷積式神經網路,主動學習,神經網路調整,晶片上訓練,CMOS 數位積體電路, Deep learning,convolutional neural network,active learning,neural network adaptation,on-chip training,CMOS digital integrated circuits,
出版年 :	2019
學位:	碩士
摘要:	深度學習已廣泛使用於各種領域，並且在某些應用上已達到超越人類的性能。為了滿足於對於運算能力的需求，目前已有許多客製化的深度神經網路推論加速器。主動學習機制對於安全與隱私保護有顯著的提升，特別是用於醫療照護及身份認證，針對使用者特徵進行神經網路調整更可進一步提高辨識準確度。這些功能都仰賴晶片上訓練，但目前支援晶片上訓練卻相當有限。考量訓練計算所需的運算複雜度比起推論高上許多，設計具高能量效率能支援推論及訓練的深度學習處理器相當具有挑戰。本設計提出文獻上第一顆可同時支援深度神經網路推論與訓練的客制化卷積式神經網路處理器，可支援各種神經網路維度與多種精準度需求。針對推理及訓練的卷積運算流程進行資料重新排序，並將卷積層及完全連接層的運算轉換成相同運算以大幅提升處理器性能。最大池化層與線性整流器的共同設計可以降低近75%的記憶體需求。簡化後的歸一化函數可省下78%的硬體資源。浮點數與固定點數整合分別為乘法器與加法器省下56.8%與17.3%的硬體資源，合併兩者之乘加器更能進一步節省33%的硬體資源。透過資料閘控及時脈閘控，可以在低精準度模式省下62%功率消耗。處理器以40nm實現，推理階段能達到1.25 TOPS/W的能量效能，與文獻上之推理加速器效能相當。操作於訓練模式之能量效能可達 327 GOPS/W之，達到高於CPU 105倍的能量效率。 Deep learning has been widely deployed in many areas and demonstrates beyond-human performance in some applications. In order to meet the required computing power, many dedicated accelerators for deep neural networks in inference have been proposed. Active learning is beneficial to security and privacy protection, especially for the health care and ID verification. In addition, Self-adaptation can further improve the classification performance by leveraging users’ specific features. These are enabled by on-chip training, but only limited on-chip training is supported by existing solutions. Considering the computational complexity of training is much higher than that of inference, designing an energy-efficient processor for both inference and training is very challenging. This work presents a deep learning processor that can support both inference and training for convolutional neural networks with any dimensions and variable precisions. Data re-arrangement and operations formulation are utilized to significantly improve the performance. Maxpooling and ReLU modules are co-designed to reduce the memory requirement by 75%. The softmax function is modified to reduce the hardware area by 78%. The integration of fixed-point and floating-point operators reduce the area of multipliers and adders by 56.8% and 17.3%, respectively. Integrating a multiplier and an adder into a unified MAC unit further reduces area by 33%. In the low-precision mode, clock gating and data gating are employed to reduce the power consumption by 62%. Fabricated in 40nm technology, the proposed deep learning processor achieves 1.25TOPS/W in inference, which is competitive with state-of-the-art inference designs. The chip also delivers an energy efficiency of 327GOPS/W in training, which is 105 higher than a high-end CPU.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78852
DOI:	10.6342/NTU201900136
全文授權:	未授權
電子全文公開日期:	2024-01-24
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 未授權公開取用	3.73 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。