Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102229| Title: | 適用於先進深度神經網路之高能效硬體加速器設計 Energy-Efficient Hardware Accelerator Design for Advanced Neural Networks |
| Authors: | 林明廣 Ming-Guang Lin |
| Advisor: | 吳安宇 An-Yeu Wu |
| Keyword: | 深度神經網路,能源效率硬體加速器稀疏度記憶體內運算 Deep neural networks,Energy-efficientHardware acceleratorSparsityComputing-in-memory |
| Publication Year : | 2026 |
| Degree: | 博士 |
| Abstract: | 隨著人工智慧 (Artificial Intelligence, AI) 技術的快速發展,神經網路模型和硬體架構的進步相輔相成。早期的卷積神經網路 (Convolution Neural Networks, CNN) 通過傳統卷積運算在影像識別中展示了強大的局部特徵提取能力。隨後,深度卷積和點卷積的創新提升了計算效率和模型性能。而近年來,Transformer架構的出現徹底改變了這一領域,讓模型能夠捕捉輸入資料中的全局特徵,並在自然語言處理 (Natural Language Processing) 及電腦視覺 (Computer Vision) 領域取得顯著突破。
由於深度學習模型架構日益複雜,對於能源高效的硬體解決方案的需求也愈發增加,特別是在計算從雲端平台轉向智慧邊緣裝置的情況下。通用處理器無法有效應對大規模卷積運算和Transformer模型中複雜的計算及大量記憶體需求,因此需要為CNN中的各式卷積運算和Transformer中的多頭注意力 (Multi-head Attention, MHA) 設計專用加速器,以提高吞吐量和能效。此外,在運算架構方面,處理器與記憶體之間的數據傳輸過程消耗大量能量,記憶體頻寬為性能的關鍵瓶頸。新興的數位記憶體運算 (Digital Compute-in-memory, DCIM) 架構通過在記憶體陣列內執行運算來減少數據傳輸,提高能源效率,然而,由於加法樹電路在功率和面積上的需求,這一架構在資源受限的邊緣裝置中仍面臨挑戰。 為了克服上述困難,本論文提出三項關鍵貢獻,旨在設計高能效的硬體加速器以支持先進的神經網路模型。首先,我們提出了一個專用的卷積模型加速器,該加速器整合了一個具備自適應資料對映能力的可重組處理單元 (PE) 陣列,以最佳化硬體使用率,並搭載參模資料壓縮引擎,以降低對外部記憶體的存取需求。其次,我們針對Transformer模型,提出一種稀疏性感知演算法與硬體協同設計的方案。在演算法層面,我們引入交替稀疏協同優化 (ASCO) 框架,以系統性地優化模型中的動態與靜態稀疏性;在硬體層面則設計了基於部分積的注意力推測單元來跳過不必要的計算、高效的Softmax引擎來優化Softmax運算、以及稀疏自適應計算引擎來有效處理稀疏數據。最後,我們針對數位記憶體運算架構進行優化,將傳統的加法樹替換為基於OR閘的近似加法樹,並結合稀疏感知編碼和記憶體內交換機制來減少誤差。總結來說,本論文開發了創新的硬體加速器和架構優化方案,以實現先進神經網路在邊緣設備上的高效能部署。 The rapid advancement of artificial intelligence (AI) has been driven by significant innovations in both neural network models and hardware architectures. Early convolutional neural networks (CNNs) demonstrated the power of localized feature extraction via vanilla convolutions for image recognition tasks. Subsequent innovations introduced depthwise and pointwise convolutions, enhancing computational efficiency and model performance. The emergence of Transformer architectures has revolutionized the field by enabling models to capture global dependencies within data, leading to significant improvements in natural language processing (NLP) and computer vision (CV) tasks. As deep learning models become increasingly complex, there is a growing demand for energy-efficient hardware solutions, particularly as computation shifts from cloud-based platforms to intelligent edge devices. General-purpose processors struggle with the computational and memory demands of large-scale convolutional and Transformer-based networks. Therefore, dedicated accelerators for complex computations such as convolutions in CNNs and multi-head attention (MHA) in Transformers are required to improve the throughput and energy efficiency. Additionally, memory bandwidth remains a critical bottleneck due to the extensive data movement. Emerging digital compute-in-memory (DCIM) architectures alleviate this by performing computations within the memory crossbar to reduce energy-hungry data transfers. Despite its remarkable potential in enhancing performance and efficiency, the adder tree circuits dominate the power and area, thereby limiting its practicality in resource-constrained edge environments. This dissertation presents three key contributions aimed at designing energy-efficient hardware accelerators for advanced neural networks. First, we propose a dedicated accelerator for CNN-based EfficientDet model. The accelerator incorporates a reconfigurable PE array with adaptive data mapping to optimize hardware utilization and a tri-mode activation compression engine to minimize external memory access. Second, we design an energy-efficient accelerator for versatile Transformer-based models. At algorithm level, we introduce an alternating sparsity co-optimization (ASCO) framework to systematically optimize both dynamic and static sparsity in the model. At hardware level, the accelerator features a Partial Product-based Attention Speculation Unit (PASU) to skip unnecessary computations, an Efficient Softmax Engine (ESE) that optimizes the Softmax computation, and a Sparsity-Adaptive Computation Engine (SACE) to efficiently handle sparse data. Lastly, we improve digital compute-in-memory (DCIM)-based architectures by replacing exact adder trees with OR-based approximate adder trees. This is complemented by the proposed sparsity-aware encoding (SAE) scheme for input activations and an in-memory swapping (IMS) mechanism to mitigate approximation errors without retraining. In summary, this dissertation develops innovative hardware accelerators and architectural optimizations for practical deployment of advanced neural networks at edge. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102229 |
| DOI: | 10.6342/NTU202600812 |
| Fulltext Rights: | 未授權 |
| metadata.dc.date.embargo-lift: | N/A |
| Appears in Collections: | 電子工程學研究所 |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-114-2.pdf Restricted Access | 3.85 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
