適用於記憶體內運算之神經網路演算法與架構共同設計

張承洋; Cheng-Yang Chang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93275

標題:	適用於記憶體內運算之神經網路演算法與架構共同設計 Computing-in-Memory-based Neural Network Algorithm and Architecture Co-Design
作者:	張承洋 Cheng-Yang Chang
指導教授:	吳安宇 An-Yeu Wu
關鍵字:	深度神經網路,記憶體內運算,模型壓縮,近似運算,動態推論, Deep neural network,Computing-in-memory,Model compression,Approximate computing,Dynamic inference,
出版年 :	2024
學位:	博士
摘要:	隨著資訊與電子產業的高速發展，全球產生的資料量也指數型上升，國際數據資訊 (International Data Corporation, IDC) 指出，在2025年時，全球的年資料產量將會到達175兆GB，因此新型態的資料處理框架需要被提出。同時，深度學習 (Deep Learning) 技術快速興起，在電腦視覺及自然語言領域打敗傳統演算法，並融入人們的日常生活中，許多應用逐漸從傳統只依賴雲端運算的解決方案，移向智慧邊緣裝置，將傳統以“運算”為中心的計算方式，拓展成以“資料”為中心的運算方式。但是隨著深度學習的快速發展，現有的邊緣裝置將面臨記憶體資料傳輸瓶頸，即使處理器的運算速度遠快於記憶體讀寫，資料處理速度仍會受記憶體傳輸頻寬局限，不足以支撐如此複雜的演算法以及如此龐大的模型！基於記憶體內運算 (Computing-in-Memory) 之非馮紐曼架構 (non-von Neumann Architecture) 逐漸興起，將運算邏輯嵌入記憶體單元，解決資料傳輸的瓶頸問題，並具備低功耗以及高密度之優點，能有效提升運算能源效率，近年來，記憶體內運算與神經網路模型應用相結合，取得了卓越的推論性能。然而，隨著深度學習應用的複雜度提升，神經網路模型參數量的成長速度遠高於記憶體內運算所帶來之性能提升，因此眾多研究人員也紛紛轉向思考如何針對記憶體內運算技術優化模型架構，開發硬體友善之演算法。在本論文中，我們的目標在於利用演算法-架構共同優化概念，引進神經網路之稀疏性 (Sparsity) 提升記憶體內運算的能源效率。儘管神經網路之推論受益於稀疏性，其稀疏粒度 (Granularity) 可根據記憶體內運算的模式進一步改善，神經網路各層的稀疏性大小也應採取系統化的設定方式，以確保能源效率的最大化。然而，過往文獻在解決運算效能時僅著重於壓縮模型大小，忽略記憶體內運算架構能耗分布特性，且針對兩種常見的模型壓縮方法: 剪枝 (Pruning) 及量化 (Quantization) 分開進行優化，導致最終神經網路模型之推論能耗仍高於預期; 另外，在記憶體內運算系統中使用類比-數位轉換器 (ADCs) 占據了能耗的重要部分，過去文獻雖已經探討使用低精度ADCs以節省能耗，或是利用稀疏偵測機制避免ADC資源的浪費，但這些方法須依賴訓練數據的調整以最小化模型的準確度損失，造成較高的部署前成本。為了克服上述困難，本論文提出在記憶體內運算基礎上具備能耗覺察 (Energy-aware) 特性的模型壓縮技術，將壓縮所帶來的能耗下降量作為決定稀疏程度的依據，讓模型針對能耗較大的權重進行壓縮; 此外，我們亦提出可訓練化的參數實現位元層級的壓縮，將剪枝/量化技術統一視為混合精度量化 (Mixed-Precision) 的選項，在壓縮過程中進行可微分的共同搜索，以確保模型在準確率與能耗之間取得最佳平衡。此外，我們基於近似運算 (Approximate Computing) 想法，提出了一種即時資料位寬調整的數值範圍感知舍入 (Range-aware Rounding) 技術，避免部署前調整模型權重的成本，此技術可以使用動態塊浮點算法 (Dynamic Block-Floating-Point Arithmetic) 整合到記憶體內運算架構，降低高功耗的ADC存取次數，亦能配合動態推論提升推論能源效率及吞吐量。本論文提出適用於記憶體內運算之神經網路推論運算架構，在台積電28奈米製程環境下整合記憶體內運算模塊及數位模組，實現上述兩套演算法於晶片實作，藉此驗證此運算架構能達到較高的能源效率。 With the rapid development of the information and electronics industry, the amount of data generated globally has been rising exponentially. The International Data Corporation (IDC) predicts that by 2025, the annual global data production will reach 175 zettabytes, necessitating new data processing frameworks. Concurrently, deep learning (DL) technologies have quickly emerged, surpassing traditional algorithms in fields such as computer vision and natural language processing and integrating them into daily life. Many applications are gradually shifting from cloud-based solutions to intelligent edge devices, transitioning from computation-centric to data-centric approaches. However, with the rapid advancement of DL, existing edge devices face a bottleneck in data transfer, as the speed of processors far exceeds that of memory read/write operations, limiting the data processing speed and insufficiently supporting complex algorithms and large models. Computing-in-Memory (CIM) based on the non-von Neumann architecture embeds computational logic within memory units to address data transfer bottlenecks, offering low power consumption and high density. Combining CIM with deep neural networks (DNN) has recently achieved outstanding inference energy efficiency. However, as the complexity of DL applications and the size of the DNN model grow much faster than the performance improvements from CIM, researchers are increasingly focusing on optimizing model architectures to develop hardware-friendly algorithms for CIM technology. This dissertation aims to leverage sparsity in convolutional neural network (CNN) models to enhance the energy efficiency of CIM with algorithm-architecture co-optimization. While CNN inference benefits from sparsity, its granularity should be adapted to the computing scheme of CIM. In addition, a systematic approach should be adopted to set the sparsity levels of different layers. Previous literature focused on reducing model size to enhance efficiency, overlooking the energy consumption distribution characteristics of the CIM architecture. Furthermore, standard compression techniques, e.g., pruning and quantization, are optimized separately, resulting in sub-optimal solutions. Meanwhile, Analog-Digital Converters (ADCs) in CIM systems account for significant energy consumption. Although previous studies have explored using low-precision ADCs or employing sparsity detection mechanisms to avoid wasting ADC resources, these methods rely on adjusting model weights according to calibration data, leading to additional pre-deployment costs. This dissertation proposes an energy-aware model compression technique for CIM to overcome these challenges. We decide the sparsity level according to the energy reduction from compression, preferentially compressing energy-intensive weight groups. Additionally, we introduce trainable parameters for bit-level compression, treating pruning/quantization as mixed-precision quantization options and conducting a differentiable joint search during compression to ensure an optimal balance between accuracy and energy consumption. Meanwhile, based on approximate computing, we propose a Range-aware Rounding technique for run-time bit-width adjustment to avoid pre-deployment costs. This technique can be integrated into the CIM architecture using Dynamic Block-Floating-Point (BFP) Arithmetic, enhancing inference performance by reducing ADC accesses. Dynamic inference mechanisms can also be adapted to exploit input-specific redundancy to improve efficiency. Finally, this dissertation presents an architectural design for CNN inference, integrating analog CIM macros and digital modules in TSMC 28nm process environment. We implement the algorithms mentioned above to validate that the proposed CIM engine can provide competitive energy efficiency.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93275
DOI:	10.6342/NTU202401980
全文授權:	同意授權(限校園內公開)
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	7.32 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。