Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電子工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100956
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor吳安宇zh_TW
dc.contributor.advisorAn-Yeu Wuen
dc.contributor.author劉冠亨zh_TW
dc.contributor.authorKuan-Heng Liuen
dc.date.accessioned2025-11-26T16:14:46Z-
dc.date.available2025-11-27-
dc.date.copyright2025-11-26-
dc.date.issued2025-
dc.date.submitted2025-10-22-
dc.identifier.citation[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] P. Micikevicius et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.
[3] J. D. M.-W. C. Shen et al., "Q-BERT: Hessian based ultra low precision quantization of BERT," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8815-8821, 2020.
[4] A. Ainslie et al., "GQA: Training generalized multi-query transformer models from multi-head checkpoints," arXiv preprint arXiv:2305.13245, 2023.
[5] Y. Leviathan, M. Kalman, and Y. Matias, "Fast inference from transformers via speculative decoding," in Proceedings of the 40th International Conference on Machine Learning, vol. 202, pp. 19274-19286, 2023.
[6] T. Dao et al., "FlashAttention-2: Faster attention with better parallelism and work partitioning," arXiv preprint arXiv:2307.08691, 2023.
[7] A. Holtzman et al., "The curious case of neural text degeneration," arXiv preprint arXiv:1904.09751, 2019.
[8] ASP-DAC 2024 Tutorial-7.
[9] W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," ACM SIGARCH Computer Architecture News, vol. 23, no. 1, pp. 20-24, 1995.
[10] J. Kaplan et al., "Scaling Laws for Neural Language Models," arXiv preprint arXiv:2001.08361, 2020.
[11] T. Dettmers et al., "QLoRA: Efficient finetuning of quantized LLMs," arXiv preprint arXiv:2305.14314, 2023.
[12] G. Georgiev, "llama.cpp: The history and internals," GitHub Repository Documentation, 2023.
[13] G. Georgiev and contributors, "llama.cpp: High-performance inference for LLMs," GitHub Repository.
[14] H. Touvron et al., "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023.
[15] G. Georgiev, "ggml: Tensor library for machine learning," GitHub Repository, 2023. [Online]. Available: https://github.com/ggerganov/ggml.
[16] M. Ham, "pynq_api: C API drivers for PYNQ FPGA board," GitHub Repository, 2023. [Online]. Available: https://github.com/mesham/pynq_api.
[17] J. Haris et al., "SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference," 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Belo Horizonte, Brazil, 2021.
[18] H. Xu, Y. Li, and S. Ji, "LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs," 2024 IEEE 10th World Forum on Internet of Things (WF-IoT), Ottawa, ON, Canada, 2024.
[19] J. Li, T. Li et al., "Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA," 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025.
[20] J. Smith et al., "LoopLynx: A Scalable Dataflow Architecture for Efficient LLM Inference on FPGAs," in Proceedings of DATE 2025, Lyon, France, 2025.
[21] K. Lee et al., "Research on Low-Latency Inference and Training Efficiency for GNN and LLM-based Systems," arXiv preprint arXiv:2507.01035, 2025.
[22] S. Kim et al., "Adaptive Resource Synchronization in FPGA-Accelerated LLM Inference Systems," in Proceedings of the IEEE International Conference on Edge Computing (EDGE), pp. 45-58, 2025.
[23] R. Bansal, "Perplexity Metric for LLM Evaluation," Analytics Vidhya Blog, Apr. 2025. [Online]. Available: https://www.analyticsvidhya.com/blog/2025/04/perplexity-metric-for-llm-evaluation.
[24] D. Wang et al.,“StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices,” in Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (ACM e-Energy ’25), 2025.
[25] X. Xiao et al., "LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference," Apple Machine Learning Research, Jul. 2024. [Online]. Available: https://machinelearning.apple.com/research/dynamic-token-pruning.
[26] W. Kwon et al. , “Learned Token Pruning for Transformers,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), Washington, DC, USA, 2022, pp. 1-11.
[27] C. Chen et al., "Token Pruning in Multimodal Large Language Models," in Findings of the Association for Computational Linguistics: ACL 2025, pp. 13456-13468, 2025.
[28] S. Alvar et al., "DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14567-14576, 2025.
[29] M. Goyal et al., "Efficient Token Pruning in Sparse Transformers for Edge Devices," in Proceedings of the International Conference on Learning Representations (ICLR), pp. 1123-1138, 2025.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100956-
dc.description.abstract大型語言模型以深層注意力機制於文本生成、知識推理和跨模態理解展現前所未有的精確度,惟龐大參數量與高記憶體頻寬需求,使其在手機、穿戴裝置、無人機等邊緣場域難以即時完成推論。為解決此瓶頸,本論文提出邊緣化大型語言模型推論系統—一套建構於現場可程式化邏輯閘陣列之硬體/軟體協同平台,利用可重組邏輯結合低位元量化模型,兼顧彈性、能效與即時性。
本系統核心包含三項創新模組:第一,矩陣乘法引擎(Matrix-Multiplication Engine)以向量點積單元串接區塊化乘加加速器,考慮量化格式、並支援多執行緒執行,全面提升匯流排帶寬利用率與計算密度;第二,DMA-計算交錯排程配合執行緒感知策略,將資料搬移與前向運算時間重疊,顯著縮短長序列解碼延遲;第三,字典導向靜態令牌剪枝,於預填充階段依詞元與錨點相似度剔除低重要度詞元,避免無效乘加並縮減快取。
三項技術透過擴增之 PYNQ-C API 緊密整合至開源框架 Llama.cpp,使用者僅需載入標準GGUF格式模型即可啟動硬體加速,無須修改網路結構或重新量化。整體設計在不增加記憶體資源的前提下,同步提升prefill與解碼階段效能,支援交互式對話、語音輔助與本機生成等邊緣情境,充分展現現場可程式化邏輯閘陣列在去中心化LLM部署之潛能。此平台亦能根據學界與產業需求,延伸至多模態模型、動態剪枝或自適應量化,為未來低功耗智慧終端奠定基礎。
zh_TW
dc.description.abstractLarge Language Models (LLMs), leveraging deep attention mechanisms, have demonstrated unprecedented accuracy in text generation, knowledge reasoning, and cross-modal understanding. However, their massive parameter counts and high memory bandwidth requirements pose significant barriers to real-time inference on edge devices such as smartphones, wearables, and drones. To overcome these bottlenecks, this thesis introduces Edge-LLM—a hardware-software co-designed inference system built on Field-Programmable Gate Arrays (FPGAs). By exploiting reconfigurable logic and low-bit quantization, the system achieves a balanced trade-off between flexibility, energy efficiency, and real-time performance.

The core of the system comprises three innovative modules. First, a Matrix-Multiplication Engine integrates vector-dot processing units with chunk-based multiply-accumulate accelerators, optimizing for quantized formats and supporting multi-threaded execution to maximize bus utilization and computational density. Second, an interleaved DMA-compute scheduling mechanism, combined with a thread-aware acceleration strategy, overlaps data transfers with forward computations, substantially reducing latency in long-sequence decoding. Third, a dictionary-guided static token pruning approach eliminates low-importance tokens during the prefill stage by measuring similarity to anchor points, thereby avoiding redundant multiply-accumulate operations and minimizing cache usage.

These modules are seamlessly integrated into the open-source Llama.cpp framework via extended PYNQ-C APIs, allowing users to load standard GGUF-format models for hardware acceleration without modifying network architectures or requantizing. The design enhances both prefill and decoding performance without additional memory overhead, enabling applications such as interactive dialogue, voice assistance, and on-device generation in edge scenarios. This platform fully harnesses the potential of FPGAs for decentralized LLM deployment and can be extended to multimodal models, dynamic pruning, or adaptive quantization to meet evolving academic and industrial needs.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:14:46Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-11-26T16:14:46Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsCONTENTS
誌謝 i
摘要 iii
ABSTRACT v
CONTENTS vii
LIST OF FIGURES xi
LIST OF TABLES xiii
Chapter 1 Introduction 1
1.1 Background 1
1.1.1 Large Language Models (LLMs) 1
1.1.2 Efficient Training and Inference 3
1.1.3 Text-Generation Process and Decoding 5
1.1.4 Memory-Bandwidth Gap on Edge 6
1.2 Problem Statement 8
1.3 Motivation and Main Contributions 10
1.3.1 Latency Bottlenecks in Edge LLM Inference 10
1.3.2 Opportunities for Chunk-Based MatMul Accelerator 12
1.3.3 Software–Hardware Co-Design with FPGA 13
1.3.4 Static Token Pruning for Prefill Acceleration 14
1.4 Thesis Objectives 14
1.5 Thesis Organization 16
Chapter 2 Review of Edge LLM Systems 19
2.1 Edge Constraints and Challenges 19
2.1.1 Compute/Memory/Power Limitations 19
2.1.2 Runtime Bandwidth Analysis 20
2.2 Llama.cpp Software Framework 21
2.2.1 GGUF Format & Mixed-Precision Quantization 22
2.2.2 Python-C Bindings Overhead 23
2.3 PYNQ-C Framework for FPGA Integration 23
2.3.1 Overlay Loading & Memory-Mapping 24
2.3.2 Hardware API Abstraction 25
2.4 Related FPGA-Based Edge-LLM System 26
2.4.1 SECDA-LLM [17] 26
2.4.2 LlamaF [18] 28
2.4.3 Efficient LLM Inference System [19] 29
2.5 Challenges of the Prior Works 31
Chapter 3 Matrix-Multiplication Engine (MME) 33
3.1 System Overview 33
3.2 Chunk-Based MatMul Accelerator (CMMA) 35
3.2.1 DMA-Aware Dataflow Design 36
3.2.2 Optimizing Throughput 37
3.3 Vector-dot Processing Unit (VPU) 39
3.3.1 Scalability 41
3.4 Resource Utilization 42
3.5 Summary 43
Chapter 4 Software–Hardware Joint Optimizations 47
4.1 Multi-Channel DMA 47
4.2 Interleaved DMA-Compute Scheduling 48
4.2.1 Overlap Technique 50
4.3 Thread-Aware Hardware Acceleration 51
4.3.1 Workload Partitioning 51
4.3.2 Threading Synchronization Strategies 53
4.4 Summary 54
Chapter 5 Dictionary-Guided Static Token Pruning 59
5.1 Dataset 59
5.2 Perplexity 60
5.3 Token Elimination Algorithm 63
5.3.1 Token-level Pruning 63
5.3.2 Experiment Result 66
5.3.3 Performance analysis with pruning ratio 67
5.4 Summary 68
Chapter 6 Contributions and Future Works 72
6.1 Contributions 72
6.2 Future Works 73
6.2.1 Large-Scale and Sparsity-Aware CMMA 73
6.2.2 Dynamic Runtime Token Pruning 74
Reference 76
-
dc.language.isozh_TW-
dc.subject現場可程式化邏輯閘陣列-
dc.subject大型語言模型-
dc.subject邊緣推論-
dc.subject軟硬體協同-
dc.subject令牌剪枝-
dc.subjectFPGA-
dc.subjectLLM-
dc.subjectEdge Inference-
dc.subjectHardware-Software Co-Design-
dc.subjectToken Pruning-
dc.title基於可程式化邏輯陣列實現高效邊緣生成式大型語言模型推論引擎zh_TW
dc.titleEfficient Edge Inference Engine for Generative Large Language Models on FPGAen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳坤志;沈中安;阮聖彰zh_TW
dc.contributor.oralexamcommitteeKun-Chih Chen;Chung-An Shen;Shanq-Jang Ruanen
dc.subject.keyword現場可程式化邏輯閘陣列,大型語言模型邊緣推論軟硬體協同令牌剪枝zh_TW
dc.subject.keywordFPGA,LLMEdge InferenceHardware-Software Co-DesignToken Pruningen
dc.relation.page78-
dc.identifier.doi10.6342/NTU202504594-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-10-22-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電子工程學研究所-
dc.date.embargo-lift2025-11-27-
顯示於系所單位:電子工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
4.31 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved