基於 PyTorch 的混合專家大型語言模型高效推理框架

蓋彥文; Yanwen Gai

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101150

Title:	基於 PyTorch 的混合專家大型語言模型高效推理框架 An Efficient PyTorch-Based Inference Framework for Mixture-of-Experts Large Language Models
Authors:	蓋彥文 Yanwen Gai
Advisor:	洪士灝 Shih-Hao Hung
Keyword:	大型語言模型推理,平行計算混合專家模型 LLM Inference,Parallel ComputingMixture of Experts
Publication Year :	2025
Degree:	碩士
Abstract:	混合專家(Mixture-of-Experts, MoE)模型因其獨特的架構帶來了強大的模型可擴展性與計算效率，正日益受到大型語言模型(LLM)研究領域的關注。然而，諸如 vLLM 和 SGLang 等支持 MoE 的主流 LLM 推理系統，儘管能達到極高性能，但其嚴重依賴針對特定硬體的客製化優化，這限制了它們在私有化部署時的靈活性與可擴展性。為了解決此局限性，我們針對 MoE 模型提出一個基於 PyTorch 的輕量級推理框架。該框架無需嚴格的硬體假定，並能在消費級 GPU 上實現高效率推理。我們設計了一個支持多維並行的引擎，並採用 PyTorch 的 CUDA Graphs 作為主要加速機制以減少內核氣泡(kernel bubbles)，此外還開發了客製化 Triton 內核以獲得額外加速。不僅如此，還設計有一個性能預測器會自動為目標計算平台搜尋最佳的並行化策略。在 NVIDIA RTX 4090 GPU 叢集上的實驗表明，與主流的基於 PyTorch 的解決方案，如 Hugging Face的Transformers 相比，本框架實現了最高達四倍的吞吐量提升。 Mixture-of-Experts (MoE) models are attracting increasing attention in large language model research, owing to their distinctive architecture, which delivers strong model scalability and computational efficiency. However, mainstream MoE LLM inference systems supporting MoE, such as vLLM and SGLang, while capable of very high performance, rely heavily on customized optimizations for specific hardware, which limits their flexibility and extensibility for private LLM deployments. To address this limitation, we introduce a lightweight PyTorch-based inference framework for MoE models that makes no strict hardware assumptions and delivers high efficiency on consumer grade GPUs. We design a multidimensional parallel engine and employ PyTorch's CUDA Graphs as the primary acceleration mechanism to reduce kernel bubbles, and we further develop custom Triton kernels for additional speedup. In addition, a performance predictor automatically searches for the optimal parallelization strategy for the target compute platform. Experiments on NVIDIA RTX 4090 GPU clusters show up to a 4x throughput improvement over other PyTorch-based solutions such as Hugging Face's Transformers.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101150
DOI:	10.6342/NTU202504827
Fulltext Rights:	同意授權(全球公開)
metadata.dc.date.embargo-lift:	2026-01-01
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-114-1.pdf	1.45 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets