Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101150
Title: 基於 PyTorch 的混合專家大型語言模型高效推理框架
An Efficient PyTorch-Based Inference Framework for Mixture-of-Experts Large Language Models
Authors: 蓋彥文
Yanwen Gai
Advisor: 洪士灝
Shih-Hao Hung
Keyword: 大型語言模型推理,平行計算混合專家模型
LLM Inference,Parallel ComputingMixture of Experts
Publication Year : 2025
Degree: 碩士
Abstract: 混合專家(Mixture-of-Experts, MoE)模型因其獨特的架構帶來了強大的模型可擴展性與計算效率,正日益受到大型語言模型(LLM)研究領域的關注。然而,諸如 vLLM 和 SGLang 等支持 MoE 的主流 LLM 推理系統,儘管能達到極高性能,但其嚴重依賴針對特定硬體的客製化優化,這限制了它們在私有化部署時的靈活性與可擴展性。為了解決此局限性,我們針對 MoE 模型提出一個基於 PyTorch 的輕量級推理框架。該框架無需嚴格的硬體假定,並能在消費級 GPU 上實現高效率推理。我們設計了一個支持多維並行的引擎,並採用 PyTorch 的 CUDA Graphs 作為主要加速機制以減少內核氣泡(kernel bubbles),此外還開發了客製化 Triton 內核以獲得額外加速。不僅如此,還設計有一個性能預測器會自動為目標計算平台搜尋最佳的並行化策略。在 NVIDIA RTX 4090 GPU 叢集上的實驗表明,與主流的基於 PyTorch 的解決方案,如 Hugging Face的Transformers 相比,本框架實現了最高達四倍的吞吐量提升。
Mixture-of-Experts (MoE) models are attracting increasing attention in large language model research, owing to their distinctive architecture, which delivers strong model scalability and computational efficiency. However, mainstream MoE LLM inference systems supporting MoE, such as vLLM and SGLang, while capable of very high performance, rely heavily on customized optimizations for specific hardware, which limits their flexibility and extensibility for private LLM deployments. To address this limitation, we introduce a lightweight PyTorch-based inference framework for MoE models that makes no strict hardware assumptions and delivers high efficiency on consumer grade GPUs. We design a multidimensional parallel engine and employ PyTorch's CUDA Graphs as the primary acceleration mechanism to reduce kernel bubbles, and we further develop custom Triton kernels for additional speedup. In addition, a performance predictor automatically searches for the optimal parallelization strategy for the target compute platform. Experiments on NVIDIA RTX 4090 GPU clusters show up to a 4x throughput improvement over other PyTorch-based solutions such as Hugging Face's Transformers.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101150
DOI: 10.6342/NTU202504827
Fulltext Rights: 同意授權(全球公開)
metadata.dc.date.embargo-lift: 2026-01-01
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-114-1.pdf1.45 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved