基於可程式化邏輯陣列實現高效邊緣生成式大型語言模型推論引擎

劉冠亨; Kuan-Heng Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100956

標題:	基於可程式化邏輯陣列實現高效邊緣生成式大型語言模型推論引擎 Efficient Edge Inference Engine for Generative Large Language Models on FPGA
作者:	劉冠亨 Kuan-Heng Liu
指導教授:	吳安宇 An-Yeu Wu
關鍵字:	現場可程式化邏輯閘陣列,大型語言模型邊緣推論軟硬體協同令牌剪枝 FPGA,LLMEdge InferenceHardware-Software Co-DesignToken Pruning
出版年 :	2025
學位:	碩士
摘要:	大型語言模型以深層注意力機制於文本生成、知識推理和跨模態理解展現前所未有的精確度，惟龐大參數量與高記憶體頻寬需求，使其在手機、穿戴裝置、無人機等邊緣場域難以即時完成推論。為解決此瓶頸，本論文提出邊緣化大型語言模型推論系統—一套建構於現場可程式化邏輯閘陣列之硬體／軟體協同平台，利用可重組邏輯結合低位元量化模型，兼顧彈性、能效與即時性。本系統核心包含三項創新模組：第一，矩陣乘法引擎（Matrix-Multiplication Engine）以向量點積單元串接區塊化乘加加速器，考慮量化格式、並支援多執行緒執行，全面提升匯流排帶寬利用率與計算密度；第二，DMA-計算交錯排程配合執行緒感知策略，將資料搬移與前向運算時間重疊，顯著縮短長序列解碼延遲；第三，字典導向靜態令牌剪枝，於預填充階段依詞元與錨點相似度剔除低重要度詞元，避免無效乘加並縮減快取。三項技術透過擴增之 PYNQ-C API 緊密整合至開源框架 Llama.cpp，使用者僅需載入標準GGUF格式模型即可啟動硬體加速，無須修改網路結構或重新量化。整體設計在不增加記憶體資源的前提下，同步提升prefill與解碼階段效能，支援交互式對話、語音輔助與本機生成等邊緣情境，充分展現現場可程式化邏輯閘陣列在去中心化LLM部署之潛能。此平台亦能根據學界與產業需求，延伸至多模態模型、動態剪枝或自適應量化，為未來低功耗智慧終端奠定基礎。 Large Language Models (LLMs), leveraging deep attention mechanisms, have demonstrated unprecedented accuracy in text generation, knowledge reasoning, and cross-modal understanding. However, their massive parameter counts and high memory bandwidth requirements pose significant barriers to real-time inference on edge devices such as smartphones, wearables, and drones. To overcome these bottlenecks, this thesis introduces Edge-LLM—a hardware-software co-designed inference system built on Field-Programmable Gate Arrays (FPGAs). By exploiting reconfigurable logic and low-bit quantization, the system achieves a balanced trade-off between flexibility, energy efficiency, and real-time performance. The core of the system comprises three innovative modules. First, a Matrix-Multiplication Engine integrates vector-dot processing units with chunk-based multiply-accumulate accelerators, optimizing for quantized formats and supporting multi-threaded execution to maximize bus utilization and computational density. Second, an interleaved DMA-compute scheduling mechanism, combined with a thread-aware acceleration strategy, overlaps data transfers with forward computations, substantially reducing latency in long-sequence decoding. Third, a dictionary-guided static token pruning approach eliminates low-importance tokens during the prefill stage by measuring similarity to anchor points, thereby avoiding redundant multiply-accumulate operations and minimizing cache usage. These modules are seamlessly integrated into the open-source Llama.cpp framework via extended PYNQ-C APIs, allowing users to load standard GGUF-format models for hardware acceleration without modifying network architectures or requantizing. The design enhances both prefill and decoding performance without additional memory overhead, enabling applications such as interactive dialogue, voice assistance, and on-device generation in edge scenarios. This platform fully harnesses the potential of FPGAs for decentralized LLM deployment and can be extended to multimodal models, dynamic pruning, or adaptive quantization to meet evolving academic and industrial needs.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100956
DOI:	10.6342/NTU202504594
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2025-11-27
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	4.31 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。