基於高階合成之矩陣乘法硬體加速及其在視覺轉換器之應用

陳維隆; Wei-Lung Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101314

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳安宇	zh_TW
dc.contributor.advisor	An-Yeu Wu	en
dc.contributor.author	陳維隆	zh_TW
dc.contributor.author	Wei-Lung Chen	en
dc.date.accessioned	2026-01-14T16:11:56Z	-
dc.date.available	2026-01-15	-
dc.date.copyright	2026-01-14	-
dc.date.issued	2025	-
dc.date.submitted	2026-01-03	-
dc.identifier.citation	[1] V. Aho et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 30. [2] Y. Huaiping et al., “Hybrid Conv-ViT network for hyperspectral image classification,” in IEEE Geoscience and Remote Sensing Letters, 2023, pp. 1-5. [3] Z. Zixiao et al., “ViT-YOLO: Transformer-based YOLO for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2799-2808. [4] B. Josh et al., “Toward transformer-based object detection,” in arXiv preprint arXiv:2012.09958, 2020. [5] S. Robin et al., “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7262-7272. [6] S. Alex et al., “Fundamentals of recurrent neural network and long short-term memory network,” in Physica D: Nonlinear Phenomena, 2020, pp. 404. [7] Guo and Tianmei et al., “Simple convolutional neural network on image classification,” in International conference on big data analysis (ICBDA), 2017, pp. 721-724. [8] Ma and Yufei et al., “ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler,” in Integration, 2018, pp. 14-23. [9] N. Dat, Park and Hyun-Cheol et al., “Edge Intelligence: A Review of Deep Neural Network Inference in Resource-Limited Environments,” in Electronics, 2025, pp. 2495. [10] Bjerge, Kim, Schougaard and J. Horsted, “A scalable and efficient convolutional neural network accelerator using HLS for a system-on-chip design,” in Microprocessors and microsystems, 2021, pp. 87. [11] Noronha, Daniel H., Salehpour, Bahar, Wilton, and Steven, “LeFlow: Enabling flexible FPGA high-level synthesis of tensorflow deep neural networks,” in FSP Workshop 2018; Fifth International Workshop on FPGAs for Software Programmers, 2018, pp. 1-8. [12] S. Arish, and Sharman et al., “Run-time-reconfigurable multi-precision floating-point matrix multiplier intellectual property core on FPGA,” in Circuits, Systems, and Signal Processing, 2017, pp. 998-1026. [13] S. E. Chang et al., “Mix and match: A novel fpga-centric deep neural network quantization framework,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 208-220. [14] Z. Chen et al., “A Reconfigurable FPGA Overlay Architecture for Matrix-Matrix Multiplication,” 2022. [15] M. Berrazuta, David, and B. Navas et al., “AHA: Design and Evaluation of Compute Intensive Hardware Accelerators for AMD-Xilinx Zynq SoCs Using HLS IP Flow,” in Computers, 2025, pp. 189.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101314	-
dc.description.abstract	Vision Transformer (ViT) 由於具備強大的全域特徵擷取與長距依存關係建模能力，在影像辨識任務中展現出卓越的效能。然而，其龐大的運算量遠高於傳統卷積神經網路，特別是在線性層 (linear layer) 中的大量矩陣乘法，使其在計算與記憶體資源受限的邊緣裝置上難以高效率部署。為解決上述問題，本研究採用高階合成 (High-Level Synthesis, HLS) 技術於可程式化邏輯閘陣列 (Field-Programmable Gate Array, FPGA) 平台上實現矩陣乘法加速。主要挑戰在於同時兼顧運算效率、記憶體頻寬與硬體資源配置的平衡。為此，本研究提出一個具層次化結構的優化方法，分為三個層級：(1) 運算層級 (Computing-based) 採用區塊矩陣乘法 (Block Matrix Multiplication, BMM) 結構以提升資料重用並降低 DRAM 存取延遲；(2) 編譯指令層級 (Pragma-based) 透過 loop pipelining、loop unrolling 與 array reshaping 技術以提升平行運算效率；(3) 系統層級 (System-level) 結合 AXI-Stream 通訊協定與 Direct Memory Access (DMA) 機制以減少 PS 與 PL 間的資料傳輸開銷。在 Xilinx ZCU104 平台上的實作結果顯示，本研究提出之多層次優化框架能有效提升運算效率並維持硬體資源的使用平衡。透過 PYNQ 平台的軟硬體整合部署，完整驗證了本研究架構的可行性與加速效果，證明階層式的 HLS 優化設計能顯著提升 Vision Transformer 模型於 FPGA 邊緣運算環境中的效能與能源效率。	zh_TW
dc.description.abstract	The vision Transformer (ViT) has demonstrated outstanding performance in visual recognition tasks due to its ability to global contextual relationships. However, its computational complexity is significantly higher than that of conventional convolutional neural networks, particularly in the fully connected linear layers. The extensive matrix multiplications within the Query–Key–Value (QKV) projections make it difficult to deploy ViT models efficiently on edge devices with limited computing and memory resources. To address this challenge, this study employs High-Level Synthesis (HLS) to accelerate matrix multiplication on FPGAs. The main design challenge lies in balancing hardware resource utilization, memory bandwidth, and computation latency. To overcome these issues, a hierarchical optimization framework is proposed, consisting of three levels: (1) computing-based optimization using Block Matrix Multiplication to enhance data reuse and reduce DRAM access; (2) pragma-based optimization leveraging loop pipelining, unrolling, and array reshaping to increase parallelism and throughput; and (3) system-level optimization integrating AXI-Stream and Direct Memory Access (DMA) to minimize data transfer overhead between the processing system and programmable logic. The proposed framework was implemented on a Xilinx ZCU104 platform and deployed through the PYNQ environment. Experimental results confirm that the multilevel optimization approach effectively enhances computational efficiency and resource utilization, providing a practical and scalable solution for accelerating vision Transformer models on FPGA-based edge devices.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-14T16:11:56Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-01-14T16:11:56Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	CONTENTS 誌謝……………………………………………………………………………………v 摘要……………………………………………………………………………………vii ABSTRACT……………………………………………………………………………ix CONTENTS……………………………………………………………………………xi LIST OF FIGURES ....................................................................................................... xv LIST OF TABLES ..................................................................................................... xviii Chapter 1 Introduction .............................................................................................. 1 1.1 Background .............................................................................................. 1 1.1.1 Transformer .................................................................................................... 1 1.1.2 Application of Vision Transformer ................................................................. 4 1.1.3 Field-Programmable Gate Arrays ................................................................... 5 1.1.4 High-Level Synthesis...................................................................................... 6 1.2 Motivation and Main Contributions ...................................................... 8 1.3 Thesis Organization ............................................................................... 10 Chapter 2 Related Works ........................................................................................ 13 2.1 High-Level Synthesis Transformation ................................................. 14 2.2 IP Core Generation ................................................................................ 16 2.3 PYNQ Deployment................................................................................. 18 2.4 Related Works ........................................................................................ 19 2.5 Summary ................................................................................................. 21 Chapter 3 Computing Optimization ...................................................................... 23 3.1 Problem of High DRAM Transfer Latency ......................................... 23 3.2 Block Matrix Multiplication ................................................................. 25 3.3 Loop Ordering Decisions ....................................................................... 27 3.3.1 Impact on Data Reuse ................................................................................... 27 3.3.2 Alignment with DRAM Access Patterns ...................................................... 29 3.4 Experimental Setup ............................................................................... 30 3.5 Experimental Results ............................................................................. 31 3.6 Summary ................................................................................................. 33 Chapter 4 Pragma Optimization ............................................................................ 35 4.1 Pipeline .................................................................................................... 35 4.2 Unroll ...................................................................................................... 36 4.3 Array Partition ....................................................................................... 38 4.4 Wider-Bus Data Packing ....................................................................... 40 4.5 Experiment Results ................................................................................ 42 4.6 Summary ................................................................................................. 48 Chapter 5 System-Level Optimization ................................................................... 49 5.1 AXI Interface Overview and Comparison ........................................... 49 5.2 AXI4-Stream and DMA Integration .................................................... 52 5.3 System-Level Design Flow with AXI-Stream ...................................... 53 5.4 Hardware System Integration in Vivado ............................................. 54 5.5 Experimental Results ............................................................................. 55 5.5.1 Hardware Simulation Results ....................................................................... 55 5.5.2 End-to-End PYNQ Deployment ................................................................... 57 5.6 Summary ................................................................................................. 58 Chapter 6 Contributions and Future Works ......................................................... 61 6.1 Overall Discussion and Conclusion ...................................................... 61 6.2 Future Work ........................................................................................... 62 Reference 65	-
dc.language.iso	en	-
dc.subject	高階合成	-
dc.subject	FPGA 加速	-
dc.subject	Vision Transformer	-
dc.subject	PYNQ 部署	-
dc.subject	High-Level Synthesis (HLS)	-
dc.subject	FPGA Acceleration	-
dc.subject	Vision Transformer (ViT)	-
dc.subject	PYNQ Deployment	-
dc.title	基於高階合成之矩陣乘法硬體加速及其在視覺轉換器之應用	zh_TW
dc.title	High-Level Synthesis-Based Hardware Acceleration of Matrix Multiplication and Application in Vision Transformers	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	盧奕璋;陳奕達	zh_TW
dc.contributor.oralexamcommittee	Yi-Chang Lu;Yi-Da Chen	en
dc.subject.keyword	高階合成,FPGA 加速Vision TransformerPYNQ 部署	zh_TW
dc.subject.keyword	High-Level Synthesis (HLS),FPGA AccelerationVision Transformer (ViT)PYNQ Deployment	en
dc.relation.page	66	-
dc.identifier.doi	10.6342/NTU202504853	-
dc.rights.note	未授權	-
dc.date.accepted	2026-01-05	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 未授權公開取用	2.1 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。