基於 PyTorch 的混合專家大型語言模型高效推理框架

蓋彥文; Yanwen Gai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101150

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪士灝	zh_TW
dc.contributor.advisor	Shih-Hao Hung	en
dc.contributor.author	蓋彥文	zh_TW
dc.contributor.author	Yanwen Gai	en
dc.date.accessioned	2025-12-31T16:07:35Z	-
dc.date.available	2026-01-01	-
dc.date.copyright	2025-12-31	-
dc.date.issued	2025	-
dc.date.submitted	2025-12-22	-
dc.identifier.citation	[1] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts, 2024. URL https://arxiv.org/abs/2401.04088. arXiv:2401.04088 [cs.LG]. [2] DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948. arXiv:2501.12948 [cs.CL]. [3] Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Meta AI blog post. [4] An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. arXiv:2505.09388 [cs.CL]. [5] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023. URL https://dl.acm.org/doi/10.1145/3600006.3613165. [6] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, et al. Sglang: Efficient execution of structured language model programs. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS), 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/724be4472168f31ba1c9ac630f15dec8-Abstract-Conference.html. [7] Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), 2020. URL https://dl.acm.org/doi/abs/10.5555/3495724.3495883. [8] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al. Scaling language models: Methods, analysis & insights from training gopher, 2021. URL https://arxiv.org/abs/2112.11446. arXiv:2112.11446 [cs.CL]. [9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311. arXiv:2204.02311 [cs.CL]. [10] Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361. arXiv:2001.08361 [cs.LG]. [11] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf. [12] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. URL https://www.cs.toronto.edu/~fritz/absps/jjnh91.pdf. [13] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181–214, 1994. URL https://www.cs.toronto.edu/~hinton/absps/hme.pdf. [14] Seniha Esen Yuksel, Joseph N. Wilson, and Paul D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012. URL https://www.ee.hacettepe.edu.tr/~eyuksel/Publications/2012_TwentyYearsofMixtureofExperts.pdf. [15] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts, 2013. URL https://arxiv.org/abs/1312.4314. arXiv:1312.4314 [cs.LG]. [16] William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning, 2022. URL https://arxiv.org/abs/2209.01667. arXiv:2209.01667 [cs.LG]. [17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. URL https://arxiv.org/abs/1701.06538. [18] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. URL https://jmlr.org/papers/v23/21-0998.html. [19] xAI. Open release of grok-1, 2024. URL https://x.ai/news/grok-os/. Grok-1 open weights and architecture release. [20] Mosaic Research. Introducing dbrx: A new state-of-the-art open llm, 2024. URL https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm. [21] Samyam Rajbhandari, Conglong Li, Zhewei Yao, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 18332–18346, 2022. URL https://proceedings.mlr.press/v162/rajbhandari22a.html. [22] Neal Vaidya, Fred Oh, and Nick Comly. Optimizing inference on large language models with nvidia tensorrt-llm, now publicly available, 2023. URL https://github.com/NVIDIA/TensorRT-LLM. [23] Amey Agrawal, Nitin Kedia, Jayashree Mohan, et al. Vidur: A large-scale simulation framework for llm inference. In Proceedings of Machine Learning and Systems (MLSys), volume 6, pages 351–366, 2024. URL https://proceedings.mlsys.org/paper_files/paper/2024/file/b74a8de47d2b3c928360e0a011f48351-Paper-Conference.pdf. [24] Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Paravecino. Toward high-performance llm serving: A simulation-based approach for identifying optimal parallelism, 2024. URL https://arxiv.org/abs/2411.17651. arXiv:2411.17651 [cs.DC]. [25] Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, et al. Towards building private llms: Exploring multi-node expert parallelism on apple silicon for mixture-of-experts large language model. In Proceedings of the 2024 International Conference on Research in Adaptive and Convergent Systems (RACS), 2024. doi: 10.1145/3649601.3698722. [26] Zihao Ye, Lequn Chen, Ruihang Lai, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/abs/2501.01005. arXiv:2501.01005 [cs.DC]. [27] Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, et al. Liger kernel: Efficient triton kernels for llm training, 2024. URL https://arxiv.org/abs/2410.10989. arXiv:2410.10989 [cs.LG].	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101150	-
dc.description.abstract	混合專家(Mixture-of-Experts, MoE)模型因其獨特的架構帶來了強大的模型可擴展性與計算效率，正日益受到大型語言模型(LLM)研究領域的關注。然而，諸如 vLLM 和 SGLang 等支持 MoE 的主流 LLM 推理系統，儘管能達到極高性能，但其嚴重依賴針對特定硬體的客製化優化，這限制了它們在私有化部署時的靈活性與可擴展性。為了解決此局限性，我們針對 MoE 模型提出一個基於 PyTorch 的輕量級推理框架。該框架無需嚴格的硬體假定，並能在消費級 GPU 上實現高效率推理。我們設計了一個支持多維並行的引擎，並採用 PyTorch 的 CUDA Graphs 作為主要加速機制以減少內核氣泡(kernel bubbles)，此外還開發了客製化 Triton 內核以獲得額外加速。不僅如此，還設計有一個性能預測器會自動為目標計算平台搜尋最佳的並行化策略。在 NVIDIA RTX 4090 GPU 叢集上的實驗表明，與主流的基於 PyTorch 的解決方案，如 Hugging Face的Transformers 相比，本框架實現了最高達四倍的吞吐量提升。	zh_TW
dc.description.abstract	Mixture-of-Experts (MoE) models are attracting increasing attention in large language model research, owing to their distinctive architecture, which delivers strong model scalability and computational efficiency. However, mainstream MoE LLM inference systems supporting MoE, such as vLLM and SGLang, while capable of very high performance, rely heavily on customized optimizations for specific hardware, which limits their flexibility and extensibility for private LLM deployments. To address this limitation, we introduce a lightweight PyTorch-based inference framework for MoE models that makes no strict hardware assumptions and delivers high efficiency on consumer grade GPUs. We design a multidimensional parallel engine and employ PyTorch's CUDA Graphs as the primary acceleration mechanism to reduce kernel bubbles, and we further develop custom Triton kernels for additional speedup. In addition, a performance predictor automatically searches for the optimal parallelization strategy for the target compute platform. Experiments on NVIDIA RTX 4090 GPU clusters show up to a 4x throughput improvement over other PyTorch-based solutions such as Hugging Face's Transformers.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-12-31T16:07:35Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-12-31T16:07:35Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Background and Related Work 4 2.1 Mixture of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Parallelization Strategies for MoE . . . . . . . . . . . . . . . . . . . 6 2.3 Inference System for MoE . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 LLM Performance Modeling . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 3 Motivation 13 3.1 Exploring Multidimensional Parallelism . . . . . . . . . . . . . . . . 15 3.2 Mitigating Kernel Bubbles . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Kernel Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 4 Methodology 21 4.1 Transformer Blocks Optimized with CUDA Graphs . . . . . . . . . . 22 4.2 Kernel Fusion with Triton . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Performance Predictor . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Inference Engine for Multidimensional Parallelism . . . . . . . . . . 27 Chapter 5 Evaluation 29 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Multidimensional Parallelism with CUDA Graphs . . . . . . . . . . 30 5.3 Custom Triton Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.4 Fidelity of the Performance Predictor . . . . . . . . . . . . . . . . . 35 Chapter 6 Conclusion 36 References 37	-
dc.language.iso	en	-
dc.subject	大型語言模型推理	-
dc.subject	平行計算	-
dc.subject	混合專家模型	-
dc.subject	LLM Inference	-
dc.subject	Parallel Computing	-
dc.subject	Mixture of Experts	-
dc.title	基於 PyTorch 的混合專家大型語言模型高效推理框架	zh_TW
dc.title	An Efficient PyTorch-Based Inference Framework for Mixture-of-Experts Large Language Models	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	施吉昇;涂嘉恒;張原豪	zh_TW
dc.contributor.oralexamcommittee	Chi-Sheng Shih;Chia-Heng Tu;Yuan-Hao Chang	en
dc.subject.keyword	大型語言模型推理,平行計算混合專家模型	zh_TW
dc.subject.keyword	LLM Inference,Parallel ComputingMixture of Experts	en
dc.relation.page	41	-
dc.identifier.doi	10.6342/NTU202504827	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-12-22	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2026-01-01	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	1.45 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。