基於小晶片架構之分散式大型語言模型推論

黃彥鈞; Yen-Chun Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101615

標題:	基於小晶片架構之分散式大型語言模型推論 Distributed Inference of Large Language Models Based on Chiplet Architecture
作者:	黃彥鈞 Yen-Chun Huang
指導教授:	吳安宇 An-Yeu Wu
關鍵字:	大型語言模型,小晶片架構分散式推論模型平行 Large Language Models (LLMs),Chiplet ArchitectureDistributed InferenceModel Parallelism
出版年 :	2026
學位:	碩士
摘要:	隨著大型語言模型（LLMs）在模型參數與輸入序列長度上的持續擴張，傳統單晶片架構已難以滿足其推論所需的計算與記憶體資源。小晶片（Chiplet）架構提供了具備可擴展性與高頻寬互連的硬體解決方案，然而目前針對此架構下之 LLM 推論仍缺乏系統性的研究與模擬工具。本論文首先提出一套基於環狀拓撲的小晶片硬體架構，並改造現有模擬器 LLMCompass [1]，使其支援小晶片硬體架構與細緻的模型分割。接著，本研究提出兩種新穎的分割策略：SP-Ring與TP-Ring。SP-Ring透過環狀交換機制將模型權重傳輸與計算重疊執行，有效降低記憶體使用、通訊延遲與冗餘計算；TP-Ring則將計算與通訊操作切分為多段並交錯執行，進一步隱藏通訊延遲，並可調整重疊因子以因應不同序列長度。模擬結果顯示，SP-Ring在晶片效能較低時較有優勢，可減少最多20.7% 的MLP與通訊延遲；TP-Ring則在系統由較高效能晶片構成時具備更好的表現，最多可減少22.8% 的非注意力區塊延遲。由於兩者共享相同的模型權重配置，推論系統可於執行時根據硬體配置與序列長度動態切換策略，以達到最佳效能。本研究為小晶片架構上的分散式大型語言模型推論提供了研究方法與策略上的設計與分析，為未來在異質整合硬體架構、稀疏模型與記憶體導向優化等方向奠定基礎。 As large language models (LLMs) continue to scale in both parameter size and input sequence length, traditional monolithic chip architectures are increasingly unable to meet the growing computational and memory demands of inference workloads. Chiplet-based architectures offer scalable and high-bandwidth hardware solutions, yet systematic studies and simulation tools for LLM inference on such platforms remain scarce. This thesis first proposes a ring-topology chiplet-based hardware architecture and extends the existing LLMCompass [1] simulator to support the proposed architecture and fine-grained model partitioning. Two novel partition strategies are introduced: SP-Ring and TP-Ring. SP-Ring overlaps model weight transfer and computation via ring-based exchange, effectively reducing memory usage, communication latency, and redundant computation. TP-Ring segments computation and communication into multiple interleaved stages, further hiding communication latency and offering tunable overlap factors to adapt to varying sequence lengths. Simulation results show that SP-Ring is more advantageous in systems with less powerful chiplets, reducing MLP + communication latency by up to 20.7%. TP-Ring performs better in systems composed of more powerful chiplets, achieving up to 22.8% latency reduction outside the attention blocks. Since both strategies share a unified model weight layout, the inference system can dynamically switch between them at runtime based on hardware configuration and sequence length to optimize overall latency. This work provides foundations for research methodology and partition strategy for distributed LLM inference on chiplet-based platforms, paving the way for future exploration in heterogeneous integration, sparsity-aware computation, and memory-centric optimization.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101615
DOI:	10.6342/NTU202600476
全文授權:	同意授權(全球公開)
電子全文公開日期:	2027-08-31
顯示於系所單位：	積體電路設計與自動化學位學程

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 此日期後於網路公開 2027-08-31	5.71 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。