以邏輯閘為運算基礎之卷積式神經網路電路設計與實作

莊詠翔; Yung-Hsiang Chuang

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102198

Title:	以邏輯閘為運算基礎之卷積式神經網路電路設計與實作 Design and Implementation of a Logic-Gate-Based Convolutional Neural Network Circuit
Authors:	莊詠翔 Yung-Hsiang Chuang
Advisor:	闕志達 Tzi-Dar Chiueh
Keyword:	可微分邏輯閘,卷積式神經網路現場可程式化邏輯閘陣列 Differentiable Logic Gate,Convolutional Neural NetworkField-Programmable Gate Array
Publication Year :	2026
Degree:	碩士
Abstract:	本研究探討以邏輯閘為運算基礎之神經網路架構，並提出一套針對卷積可微分邏輯閘神經網路（Convolutional Differentiable Logic Gate Network, CDLGN）之軟硬體協同設計與加速實現方法。相較於傳統卷積神經網路（CNN）仰賴大量乘加（MAC）運算，邏輯閘神經網路（Logic Gate Network, LGN）完全由二元邏輯閘構成，推論階段不需浮點運算，具備高度硬體友善特性。然而，當模型規模擴大時，直接將完整邏輯網路映射至硬體電路，仍面臨資源消耗與可擴充性限制等問題。本研究在演算法層面採用可微分邏輯閘（Differentiable Logic）將原本不可微分之布林運算連續化，使模型可透過梯度下降法進行訓練，並於訓練完成後離散化為純邏輯閘結構。針對影像辨識任務，本文設計卷積邏輯層與隨機連接邏輯層，並引入分組卷積與通道限制之輸入選取策略，在維持模型準確率的同時降低連結複雜度，以提升硬體實作效率。訓練過程中，除多重閾值二值化與邊緣偵測特徵強化外，亦提出多頭退火式知識蒸餾方法，以改善模型收斂穩定性與最終分類效能。在軟體訓練以及加速方面，針對 LGN 非乘加型運算不利於現有 GPU 架構之問題，本研究於 CUDA平台上實作專用之 forward 與 backward propagation 核心，加速可微分邏輯運算流程，顯著提升訓練與推論效率。在CIFAR-10、MNIST、OrganAMNIST資料集上分別達到最高83.5%、99.25%以及90.62%的辨識正確率，以及相較Torch平台加速16倍的訓練速度。在硬體實現上，本研究以查找表（LUT）方式建構可重組之邏輯閘運算單元，並設計具分組平行化特性的卷積邏輯加速架構，部署於 FPGA-SoC 平台。系統整合處理系統（PS）與可程式邏輯（PL），並結合 PyTorch 與 PYNQ 軟體框架，使整體推論流程可於無外部主機情況下獨立運作。本系統在運行約92.7KB的LGN模型時達到最大2016的平均FPS以及1.78 mJ/frame的能源效率，平均功耗約是NVIDIA H100 的0.86% 以及 KV260 PS APU 的0.09% 。實驗結果顯示，本研究所提出之架構在維持分類準確率的同時，能有效降低硬體複雜度並提升推論吞吐量，驗證邏輯閘神經網路於高效率硬體推論平台上的可行性與潛力。 This thesis investigates neural network architectures based on logic-gate operations and proposes a hardware–software co-design framework with an acceleration scheme for Convolutional Differentiable Logic Gate Networks (CDLGNs). Unlike conventional Convolutional Neural Networks (CNNs), which rely heavily on multiply-and-accumulate (MAC) operations, Logic Gate Networks (LGNs) are constructed entirely from binary logic gates. As a result, floating-point computations are eliminated during inference, making LGNs inherently hardware-friendly. However, as model complexity increases, directly mapping large-scale logic networks onto hardware circuits leads to considerable resource consumption and scalability challenges. At the algorithmic level, this work adopts Differentiable Logic to relax inherently non-differentiable Boolean operations into continuous formulations, enabling end-to-end optimization via gradient descent. After training, the network is discretized into a pure logic-gate structure for efficient inference. For image classification tasks, convolutional logic layers and randomly connected logic layers are designed to capture spatial and channel-wise features. To further reduce connection complexity and improve hardware efficiency, group convolution and channel-constrained input selection strategies are introduced, achieving a favorable balance between accuracy and implementation cost. Additionally, multi-threshold binarization and edge-detection-based feature augmentation are incorporated to enhance input representations. A multi-head annealing knowledge distillation scheme is also proposed to improve training stability and final classification performance. On the software side, the incompatibility between non-MAC logic operations and existing GPU architectures is addressed by implementing dedicated CUDA kernels for both forward and backward propagation. These customized kernels significantly accelerate differentiable logic computations, achieving up to 16× speedup compared to standard PyTorch implementations. The proposed method attains a maximum classification accuracy of 83.5% on CIFAR-10, 99.25% on MNIST and 90.62% on OrganAMNIST, demonstrating both efficiency and effectiveness. On the hardware side, reconfigurable logic-gate operators are realized using lookup table (LUT)-based designs, and a group-parallel convolutional logic accelerator is developed and deployed on an FPGA-SoC platform. The system integrates the Processing System (PS) and Programmable Logic (PL), and is combined with the PyTorch and PYNQ frameworks to enable standalone inference without requiring an external host computer. Experimental results show that, when executing an approximately 92.7 KB LGN model, the proposed system achieves up to 2016 FPS with an energy efficiency of 1.78 mJ per frame. The average power consumption is only about 0.86% of an NVIDIA H100 GPU and 0.09% of the KV260 PS APU, highlighting its superior energy efficiency. Overall, the proposed architecture maintains competitive classification accuracy while significantly reducing hardware complexity and improving inference throughput, thereby validating the feasibility and potential of Logic Gate Networks for high-efficiency hardware inference platforms.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102198
DOI:	10.6342/NTU202600875
Fulltext Rights:	同意授權(全球公開)
metadata.dc.date.embargo-lift:	2026-04-09
Appears in Collections:	積體電路設計與自動化學位學程

Files in This Item:

File	Size	Format
ntu-114-2.pdf	6.13 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets