Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 重點科技研究學院
  3. 積體電路設計與自動化學位學程
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101615
Title: 基於小晶片架構之分散式大型語言模型推論
Distributed Inference of Large Language Models Based on Chiplet Architecture
Authors: 黃彥鈞
Yen-Chun Huang
Advisor: 吳安宇
An-Yeu Wu
Keyword: 大型語言模型,小晶片架構分散式推論模型平行
Large Language Models (LLMs),Chiplet ArchitectureDistributed InferenceModel Parallelism
Publication Year : 2026
Degree: 碩士
Abstract: 隨著大型語言模型(LLMs)在模型參數與輸入序列長度上的持續擴張,傳統單晶片架構已難以滿足其推論所需的計算與記憶體資源。小晶片(Chiplet)架構提供了具備可擴展性與高頻寬互連的硬體解決方案,然而目前針對此架構下之 LLM 推論仍缺乏系統性的研究與模擬工具。
本論文首先提出一套基於環狀拓撲的小晶片硬體架構,並改造現有模擬器 LLMCompass [1],使其支援小晶片硬體架構與細緻的模型分割。接著,本研究提出兩種新穎的分割策略:SP-Ring與TP-Ring。SP-Ring透過環狀交換機制將模型權重傳輸與計算重疊執行,有效降低記憶體使用、通訊延遲與冗餘計算;TP-Ring則將計算與通訊操作切分為多段並交錯執行,進一步隱藏通訊延遲,並可調整重疊因子以因應不同序列長度。
模擬結果顯示,SP-Ring在晶片效能較低時較有優勢,可減少最多20.7% 的MLP與通訊延遲;TP-Ring則在系統由較高效能晶片構成時具備更好的表現,最多可減少22.8% 的非注意力區塊延遲。由於兩者共享相同的模型權重配置,推論系統可於執行時根據硬體配置與序列長度動態切換策略,以達到最佳效能。
本研究為小晶片架構上的分散式大型語言模型推論提供了研究方法與策略上的設計與分析,為未來在異質整合硬體架構、稀疏模型與記憶體導向優化等方向奠定基礎。
As large language models (LLMs) continue to scale in both parameter size and input sequence length, traditional monolithic chip architectures are increasingly unable to meet the growing computational and memory demands of inference workloads. Chiplet-based architectures offer scalable and high-bandwidth hardware solutions, yet systematic studies and simulation tools for LLM inference on such platforms remain scarce.
This thesis first proposes a ring-topology chiplet-based hardware architecture and extends the existing LLMCompass [1] simulator to support the proposed architecture and fine-grained model partitioning. Two novel partition strategies are introduced: SP-Ring and TP-Ring. SP-Ring overlaps model weight transfer and computation via ring-based exchange, effectively reducing memory usage, communication latency, and redundant computation. TP-Ring segments computation and communication into multiple interleaved stages, further hiding communication latency and offering tunable overlap factors to adapt to varying sequence lengths.
Simulation results show that SP-Ring is more advantageous in systems with less powerful chiplets, reducing MLP + communication latency by up to 20.7%. TP-Ring performs better in systems composed of more powerful chiplets, achieving up to 22.8% latency reduction outside the attention blocks. Since both strategies share a unified model weight layout, the inference system can dynamically switch between them at runtime based on hardware configuration and sequence length to optimize overall latency.
This work provides foundations for research methodology and partition strategy for distributed LLM inference on chiplet-based platforms, paving the way for future exploration in heterogeneous integration, sparsity-aware computation, and memory-centric optimization.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101615
DOI: 10.6342/NTU202600476
Fulltext Rights: 同意授權(全球公開)
metadata.dc.date.embargo-lift: 2027-08-31
Appears in Collections:積體電路設計與自動化學位學程

Files in This Item:
File SizeFormat 
ntu-114-1.pdf
  Until 2027-08-31
5.71 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved