基於可程式化邏輯陣列實現高效邊緣生成式大型語言模型推論引擎

劉冠亨; Kuan-Heng Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100956

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳安宇	zh_TW
dc.contributor.advisor	An-Yeu Wu	en
dc.contributor.author	劉冠亨	zh_TW
dc.contributor.author	Kuan-Heng Liu	en
dc.date.accessioned	2025-11-26T16:14:46Z	-
dc.date.available	2025-11-27	-
dc.date.copyright	2025-11-26	-
dc.date.issued	2025	-
dc.date.submitted	2025-10-22	-
dc.identifier.citation	[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in Neural Information Processing Systems, vol. 30, 2017. [2] P. Micikevicius et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017. [3] J. D. M.-W. C. Shen et al., "Q-BERT: Hessian based ultra low precision quantization of BERT," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8815-8821, 2020. [4] A. Ainslie et al., "GQA: Training generalized multi-query transformer models from multi-head checkpoints," arXiv preprint arXiv:2305.13245, 2023. [5] Y. Leviathan, M. Kalman, and Y. Matias, "Fast inference from transformers via speculative decoding," in Proceedings of the 40th International Conference on Machine Learning, vol. 202, pp. 19274-19286, 2023. [6] T. Dao et al., "FlashAttention-2: Faster attention with better parallelism and work partitioning," arXiv preprint arXiv:2307.08691, 2023. [7] A. Holtzman et al., "The curious case of neural text degeneration," arXiv preprint arXiv:1904.09751, 2019. [8] ASP-DAC 2024 Tutorial-7. [9] W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," ACM SIGARCH Computer Architecture News, vol. 23, no. 1, pp. 20-24, 1995. [10] J. Kaplan et al., "Scaling Laws for Neural Language Models," arXiv preprint arXiv:2001.08361, 2020. [11] T. Dettmers et al., "QLoRA: Efficient finetuning of quantized LLMs," arXiv preprint arXiv:2305.14314, 2023. [12] G. Georgiev, "llama.cpp: The history and internals," GitHub Repository Documentation, 2023. [13] G. Georgiev and contributors, "llama.cpp: High-performance inference for LLMs," GitHub Repository. [14] H. Touvron et al., "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023. [15] G. Georgiev, "ggml: Tensor library for machine learning," GitHub Repository, 2023. [Online]. Available: https://github.com/ggerganov/ggml. [16] M. Ham, "pynq_api: C API drivers for PYNQ FPGA board," GitHub Repository, 2023. [Online]. Available: https://github.com/mesham/pynq_api. [17] J. Haris et al., "SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference," 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Belo Horizonte, Brazil, 2021. [18] H. Xu, Y. Li, and S. Ji, "LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs," 2024 IEEE 10th World Forum on Internet of Things (WF-IoT), Ottawa, ON, Canada, 2024. [19] J. Li, T. Li et al., "Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA," 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025. [20] J. Smith et al., "LoopLynx: A Scalable Dataflow Architecture for Efficient LLM Inference on FPGAs," in Proceedings of DATE 2025, Lyon, France, 2025. [21] K. Lee et al., "Research on Low-Latency Inference and Training Efficiency for GNN and LLM-based Systems," arXiv preprint arXiv:2507.01035, 2025. [22] S. Kim et al., "Adaptive Resource Synchronization in FPGA-Accelerated LLM Inference Systems," in Proceedings of the IEEE International Conference on Edge Computing (EDGE), pp. 45-58, 2025. [23] R. Bansal, "Perplexity Metric for LLM Evaluation," Analytics Vidhya Blog, Apr. 2025. [Online]. Available: https://www.analyticsvidhya.com/blog/2025/04/perplexity-metric-for-llm-evaluation. [24] D. Wang et al.,“StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices,” in Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (ACM e-Energy ’25), 2025. [25] X. Xiao et al., "LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference," Apple Machine Learning Research, Jul. 2024. [Online]. Available: https://machinelearning.apple.com/research/dynamic-token-pruning. [26] W. Kwon et al. , “Learned Token Pruning for Transformers,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), Washington, DC, USA, 2022, pp. 1-11. [27] C. Chen et al., "Token Pruning in Multimodal Large Language Models," in Findings of the Association for Computational Linguistics: ACL 2025, pp. 13456-13468, 2025. [28] S. Alvar et al., "DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14567-14576, 2025. [29] M. Goyal et al., "Efficient Token Pruning in Sparse Transformers for Edge Devices," in Proceedings of the International Conference on Learning Representations (ICLR), pp. 1123-1138, 2025.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100956	-
dc.description.abstract	大型語言模型以深層注意力機制於文本生成、知識推理和跨模態理解展現前所未有的精確度，惟龐大參數量與高記憶體頻寬需求，使其在手機、穿戴裝置、無人機等邊緣場域難以即時完成推論。為解決此瓶頸，本論文提出邊緣化大型語言模型推論系統—一套建構於現場可程式化邏輯閘陣列之硬體／軟體協同平台，利用可重組邏輯結合低位元量化模型，兼顧彈性、能效與即時性。本系統核心包含三項創新模組：第一，矩陣乘法引擎（Matrix-Multiplication Engine）以向量點積單元串接區塊化乘加加速器，考慮量化格式、並支援多執行緒執行，全面提升匯流排帶寬利用率與計算密度；第二，DMA-計算交錯排程配合執行緒感知策略，將資料搬移與前向運算時間重疊，顯著縮短長序列解碼延遲；第三，字典導向靜態令牌剪枝，於預填充階段依詞元與錨點相似度剔除低重要度詞元，避免無效乘加並縮減快取。三項技術透過擴增之 PYNQ-C API 緊密整合至開源框架 Llama.cpp，使用者僅需載入標準GGUF格式模型即可啟動硬體加速，無須修改網路結構或重新量化。整體設計在不增加記憶體資源的前提下，同步提升prefill與解碼階段效能，支援交互式對話、語音輔助與本機生成等邊緣情境，充分展現現場可程式化邏輯閘陣列在去中心化LLM部署之潛能。此平台亦能根據學界與產業需求，延伸至多模態模型、動態剪枝或自適應量化，為未來低功耗智慧終端奠定基礎。	zh_TW
dc.description.abstract	Large Language Models (LLMs), leveraging deep attention mechanisms, have demonstrated unprecedented accuracy in text generation, knowledge reasoning, and cross-modal understanding. However, their massive parameter counts and high memory bandwidth requirements pose significant barriers to real-time inference on edge devices such as smartphones, wearables, and drones. To overcome these bottlenecks, this thesis introduces Edge-LLM—a hardware-software co-designed inference system built on Field-Programmable Gate Arrays (FPGAs). By exploiting reconfigurable logic and low-bit quantization, the system achieves a balanced trade-off between flexibility, energy efficiency, and real-time performance. The core of the system comprises three innovative modules. First, a Matrix-Multiplication Engine integrates vector-dot processing units with chunk-based multiply-accumulate accelerators, optimizing for quantized formats and supporting multi-threaded execution to maximize bus utilization and computational density. Second, an interleaved DMA-compute scheduling mechanism, combined with a thread-aware acceleration strategy, overlaps data transfers with forward computations, substantially reducing latency in long-sequence decoding. Third, a dictionary-guided static token pruning approach eliminates low-importance tokens during the prefill stage by measuring similarity to anchor points, thereby avoiding redundant multiply-accumulate operations and minimizing cache usage. These modules are seamlessly integrated into the open-source Llama.cpp framework via extended PYNQ-C APIs, allowing users to load standard GGUF-format models for hardware acceleration without modifying network architectures or requantizing. The design enhances both prefill and decoding performance without additional memory overhead, enabling applications such as interactive dialogue, voice assistance, and on-device generation in edge scenarios. This platform fully harnesses the potential of FPGAs for decentralized LLM deployment and can be extended to multimodal models, dynamic pruning, or adaptive quantization to meet evolving academic and industrial needs.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:14:46Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-11-26T16:14:46Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	CONTENTS 誌謝 i 摘要 iii ABSTRACT v CONTENTS vii LIST OF FIGURES xi LIST OF TABLES xiii Chapter 1 Introduction 1 1.1 Background 1 1.1.1 Large Language Models (LLMs) 1 1.1.2 Efficient Training and Inference 3 1.1.3 Text-Generation Process and Decoding 5 1.1.4 Memory-Bandwidth Gap on Edge 6 1.2 Problem Statement 8 1.3 Motivation and Main Contributions 10 1.3.1 Latency Bottlenecks in Edge LLM Inference 10 1.3.2 Opportunities for Chunk-Based MatMul Accelerator 12 1.3.3 Software–Hardware Co-Design with FPGA 13 1.3.4 Static Token Pruning for Prefill Acceleration 14 1.4 Thesis Objectives 14 1.5 Thesis Organization 16 Chapter 2 Review of Edge LLM Systems 19 2.1 Edge Constraints and Challenges 19 2.1.1 Compute/Memory/Power Limitations 19 2.1.2 Runtime Bandwidth Analysis 20 2.2 Llama.cpp Software Framework 21 2.2.1 GGUF Format & Mixed-Precision Quantization 22 2.2.2 Python-C Bindings Overhead 23 2.3 PYNQ-C Framework for FPGA Integration 23 2.3.1 Overlay Loading & Memory-Mapping 24 2.3.2 Hardware API Abstraction 25 2.4 Related FPGA-Based Edge-LLM System 26 2.4.1 SECDA-LLM [17] 26 2.4.2 LlamaF [18] 28 2.4.3 Efficient LLM Inference System [19] 29 2.5 Challenges of the Prior Works 31 Chapter 3 Matrix-Multiplication Engine (MME) 33 3.1 System Overview 33 3.2 Chunk-Based MatMul Accelerator (CMMA) 35 3.2.1 DMA-Aware Dataflow Design 36 3.2.2 Optimizing Throughput 37 3.3 Vector-dot Processing Unit (VPU) 39 3.3.1 Scalability 41 3.4 Resource Utilization 42 3.5 Summary 43 Chapter 4 Software–Hardware Joint Optimizations 47 4.1 Multi-Channel DMA 47 4.2 Interleaved DMA-Compute Scheduling 48 4.2.1 Overlap Technique 50 4.3 Thread-Aware Hardware Acceleration 51 4.3.1 Workload Partitioning 51 4.3.2 Threading Synchronization Strategies 53 4.4 Summary 54 Chapter 5 Dictionary-Guided Static Token Pruning 59 5.1 Dataset 59 5.2 Perplexity 60 5.3 Token Elimination Algorithm 63 5.3.1 Token-level Pruning 63 5.3.2 Experiment Result 66 5.3.3 Performance analysis with pruning ratio 67 5.4 Summary 68 Chapter 6 Contributions and Future Works 72 6.1 Contributions 72 6.2 Future Works 73 6.2.1 Large-Scale and Sparsity-Aware CMMA 73 6.2.2 Dynamic Runtime Token Pruning 74 Reference 76	-
dc.language.iso	zh_TW	-
dc.subject	現場可程式化邏輯閘陣列	-
dc.subject	大型語言模型	-
dc.subject	邊緣推論	-
dc.subject	軟硬體協同	-
dc.subject	令牌剪枝	-
dc.subject	FPGA	-
dc.subject	LLM	-
dc.subject	Edge Inference	-
dc.subject	Hardware-Software Co-Design	-
dc.subject	Token Pruning	-
dc.title	基於可程式化邏輯陣列實現高效邊緣生成式大型語言模型推論引擎	zh_TW
dc.title	Efficient Edge Inference Engine for Generative Large Language Models on FPGA	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳坤志;沈中安;阮聖彰	zh_TW
dc.contributor.oralexamcommittee	Kun-Chih Chen;Chung-An Shen;Shanq-Jang Ruan	en
dc.subject.keyword	現場可程式化邏輯閘陣列,大型語言模型邊緣推論軟硬體協同令牌剪枝	zh_TW
dc.subject.keyword	FPGA,LLMEdge InferenceHardware-Software Co-DesignToken Pruning	en
dc.relation.page	78	-
dc.identifier.doi	10.6342/NTU202504594	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-10-22	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
dc.date.embargo-lift	2025-11-27	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	4.31 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。