基於小晶片架構之分散式大型語言模型推論

黃彥鈞; Yen-Chun Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101615

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳安宇	zh_TW
dc.contributor.advisor	An-Yeu Wu	en
dc.contributor.author	黃彥鈞	zh_TW
dc.contributor.author	Yen-Chun Huang	en
dc.date.accessioned	2026-02-11T16:47:55Z	-
dc.date.available	2026-02-12	-
dc.date.copyright	2026-02-11	-
dc.date.issued	2026	-
dc.date.submitted	2026-02-02	-
dc.identifier.citation	[1] H. Zhang, A. Ning, R. B. Prabhakar, and D. Wentzlaff, "Llmcompass: Enabling efficient hardware design for large language model inference," in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024: IEEE, pp. 1080–1096. [2] OpenAI. "ChatGPT." https://openai.com/blog/chatgpt (accessed Jul. 31, 2025. [3] GitHub. "GitHub Copilot." https://github.com/features/copilot (accessed Jul. 31, 2025. [4] G. Team et al., "Gemini: a family of highly capable multimodal models," arXiv preprint arXiv:2312.11805, 2023. [5] T. Brown et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. [6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019. [7] H. Touvron et al., "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023. [8] M. Phuong and M. Hutter, "Formal algorithms for transformers," arXiv preprint arXiv:2207.09238, 2022. [9] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017. [10] S. Bubeck et al., "Sparks of artificial general intelligence: Early experiments with gpt-4," arXiv preprint arXiv:2303.12712, 2023. [11] Anthropic. "Introducing Claude." https://www.anthropic.com/news/introducing-claude (accessed Jul. 31, 2025. [12] P. Patel et al., "Splitwise: Efficient generative llm inference using phase splitting," in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024: IEEE, pp. 118–132. [13] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, "AI and memory wall," IEEE Micro, 2024. [14] K. Moraes. "Classic Moore’s Law Scaling Challenges Demand New Ways to Wire and Integrate Chips." https://www.appliedmaterials.com/us/en/blog/blog-posts/classic-moores-law-scaling-challenges-demand-new-ways-wire-and-integrate-chips.html (accessed. [15] M. G. Inc. "Ansys模擬引領CoWoS封裝與光矽子技術新突破邁向高速運算新紀元." https://www.macnica.com/apac/galaxy/zh_tw/products-support/technical-articles/ansys-lead-breakthroughs-in-CoWoS-silicon-photonics/ (accessed Jul. 31, 2025. [16] S. Naffziger et al., "Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families: Industrial product," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021: IEEE, pp. 57–70. [17] Y. S. Shao et al., "Simba: Scaling deep-learning inference with multi-chip-module-based architecture," in Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture, 2019, pp. 14–27. [18] Z. Tan, H. Cai, R. Dong, and K. Ma, "NN-baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021: IEEE, pp. 1013–1026. [19] J. Zhang et al., "INDM: Chiplet-Based Interconnect Network and Dataflow Mapping for DNN Accelerators," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023. [20] I. Ong, "Efficient Distributed LLM Inference with Dynamic Partitioning," California, Berkeley, Technical Report UCB/EECS-2024-108, May, 2024. [21] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," arXiv preprint arXiv:1909.08053, 2019. [22] D. Gu et al., "Loongtrain: Efficient training of long-sequence llms with head-context parallelism," arXiv preprint arXiv:2406.18485, 2024. [23] S. A. Jacobs et al., "Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models," arXiv preprint arXiv:2309.14509, 2023. [24] A. Juneja. "What is Inference Parallelism and How it Works." https://www.infracloud.io/blogs/inference-parallelism/ (accessed Jul. 31, 2025. [25] C. Woolley, "Nccl: Accelerated multi-gpu collective communications," NCCL-Woolley. pdf, 2015. [26] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, "A systematic methodology for characterizing scalability of dnn accelerators using scale-sim," in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2020: IEEE, pp. 58–68. [27] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, "Scale-sim: Systolic cnn accelerator simulator," arXiv preprint arXiv:1811.02883, 2018. [28] P. Patarasuk and X. Yuan, "Bandwidth optimal all-reduce algorithms for clusters of workstations," Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117–124, 2009. [29] R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in MPICH," The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101615	-
dc.description.abstract	隨著大型語言模型（LLMs）在模型參數與輸入序列長度上的持續擴張，傳統單晶片架構已難以滿足其推論所需的計算與記憶體資源。小晶片（Chiplet）架構提供了具備可擴展性與高頻寬互連的硬體解決方案，然而目前針對此架構下之 LLM 推論仍缺乏系統性的研究與模擬工具。本論文首先提出一套基於環狀拓撲的小晶片硬體架構，並改造現有模擬器 LLMCompass [1]，使其支援小晶片硬體架構與細緻的模型分割。接著，本研究提出兩種新穎的分割策略：SP-Ring與TP-Ring。SP-Ring透過環狀交換機制將模型權重傳輸與計算重疊執行，有效降低記憶體使用、通訊延遲與冗餘計算；TP-Ring則將計算與通訊操作切分為多段並交錯執行，進一步隱藏通訊延遲，並可調整重疊因子以因應不同序列長度。模擬結果顯示，SP-Ring在晶片效能較低時較有優勢，可減少最多20.7% 的MLP與通訊延遲；TP-Ring則在系統由較高效能晶片構成時具備更好的表現，最多可減少22.8% 的非注意力區塊延遲。由於兩者共享相同的模型權重配置，推論系統可於執行時根據硬體配置與序列長度動態切換策略，以達到最佳效能。本研究為小晶片架構上的分散式大型語言模型推論提供了研究方法與策略上的設計與分析，為未來在異質整合硬體架構、稀疏模型與記憶體導向優化等方向奠定基礎。	zh_TW
dc.description.abstract	As large language models (LLMs) continue to scale in both parameter size and input sequence length, traditional monolithic chip architectures are increasingly unable to meet the growing computational and memory demands of inference workloads. Chiplet-based architectures offer scalable and high-bandwidth hardware solutions, yet systematic studies and simulation tools for LLM inference on such platforms remain scarce. This thesis first proposes a ring-topology chiplet-based hardware architecture and extends the existing LLMCompass [1] simulator to support the proposed architecture and fine-grained model partitioning. Two novel partition strategies are introduced: SP-Ring and TP-Ring. SP-Ring overlaps model weight transfer and computation via ring-based exchange, effectively reducing memory usage, communication latency, and redundant computation. TP-Ring segments computation and communication into multiple interleaved stages, further hiding communication latency and offering tunable overlap factors to adapt to varying sequence lengths. Simulation results show that SP-Ring is more advantageous in systems with less powerful chiplets, reducing MLP + communication latency by up to 20.7%. TP-Ring performs better in systems composed of more powerful chiplets, achieving up to 22.8% latency reduction outside the attention blocks. Since both strategies share a unified model weight layout, the inference system can dynamically switch between them at runtime based on hardware configuration and sequence length to optimize overall latency. This work provides foundations for research methodology and partition strategy for distributed LLM inference on chiplet-based platforms, paving the way for future exploration in heterogeneous integration, sparsity-aware computation, and memory-centric optimization.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-02-11T16:47:55Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-02-11T16:47:55Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 ii 摘要 iv ABSTRACT v CONTENTS vii LIST OF FIGURES x LIST OF TABLES xii Chapter 1 Introduction 1 1.1 Background 1 1.1.1 Large Language Models (LLMs): Applications, Architecture, and Scaling 1 1.1.2 Advanced Packaging and Chiplet Architecture 5 1.2 Motivation and Objective 8 1.2.1 Lack of Study for Distributed LLM Inference on Chiplet Architecture 8 1.2.2 Inevitable Communication Overhead in Fine-grained Partition Strategies 9 1.3 Thesis Target 11 1.4 Thesis Organization 12 Chapter 2 Review of Prior Works 14 2.1 Chiplet-based Neural Network Inference Engine and Mapper 14 2.2 LLM Partition Strategies 16 2.3 Challenges of the Prior Works 19 Chapter 3 Hardware Architecture and Simulator 22 3.1 Proposed Ring-topology-based Hardware Architecture 22 3.2 The Simulation Framework based on LLMCompass 25 3.3 Summary 27 Chapter 4 Proposed SP-Ring Partition Strategy 28 4.1 Challenges for Megatron and Weight-Gathered Partition Strategies 28 4.2 Data Flow and Communication Volume 29 4.3 Memory Footprint 31 4.4 Latency Estimation 32 4.5 Simulation Results and Analysis 34 4.6 Summary 39 Chapter 5 Proposed TP-Ring Partition Strategy 41 5.1 Challenges for the Megatron Partition Strategy 41 5.2 TP-Ring Attention and TP-Ring MLP Data Flow 42 5.3 Overlapping Timing and Latency Estimation 44 5.4 Simulation Results and Analysis 47 5.4.1 Case 1: Communication Fully Hidden 51 5.4.2 Case 2: 1/OF Communication Visible 52 5.4.3 Case 3: 1.5/OF Communication Visible 53 5.5 Summary 55 Chapter 6 Dynamic Switching among SP-Ring, TP-Ring, and Other Strategies 57 Chapter 7 Conclusions and Future Directions 61 7.1 Main Contributions 61 7.2 Future Directions 62 Reference 64	-
dc.language.iso	en	-
dc.subject	大型語言模型	-
dc.subject	小晶片架構	-
dc.subject	分散式推論	-
dc.subject	模型平行	-
dc.subject	Large Language Models (LLMs)	-
dc.subject	Chiplet Architecture	-
dc.subject	Distributed Inference	-
dc.subject	Model Parallelism	-
dc.title	基於小晶片架構之分散式大型語言模型推論	zh_TW
dc.title	Distributed Inference of Large Language Models Based on Chiplet Architecture	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳坤志;沈中安;阮聖彰	zh_TW
dc.contributor.oralexamcommittee	Kun-Chih Chen;Chung-An Shen;Shanq-Jang Ruan	en
dc.subject.keyword	大型語言模型,小晶片架構分散式推論模型平行	zh_TW
dc.subject.keyword	Large Language Models (LLMs),Chiplet ArchitectureDistributed InferenceModel Parallelism	en
dc.relation.page	65	-
dc.identifier.doi	10.6342/NTU202600476	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-02-04	-
dc.contributor.author-college	重點科技研究學院	-
dc.contributor.author-dept	積體電路設計與自動化學位學程	-
dc.date.embargo-lift	2027-08-31	-
顯示於系所單位：	積體電路設計與自動化學位學程

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 此日期後於網路公開 2027-08-31	5.71 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。