透過共享 GPU 記憶體的發佈/訂閱中介軟體實現低延遲邊緣人工智慧通訊

官澔恩; Hao-En Kuan

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97774

Title:	透過共享 GPU 記憶體的發佈/訂閱中介軟體實現低延遲邊緣人工智慧通訊 Low-Latency Edge AI Communication with Pub/Sub Middleware via GPU Memory Sharing
Authors:	官澔恩 Hao-En Kuan
Advisor:	洪士灝 Shih-Hao Hung
Keyword:	發佈/訂閱中介軟體,圖形處理器,共享記憶體,即時動態記憶體配置器, Pub/Sub Middleware,GPU,Shared Memory,Real-Time Dynamic Memory Allocator,
Publication Year :	2025
Degree:	碩士
Abstract:	即時邊緣人工智慧應用通常需要有效率的 GPU 資料處理與傳輸。由於這類應用通常高度模組化，因此廣泛使用發佈－訂閱模式以在各個元件之間傳遞資料。然而，現有的發佈－訂閱中介軟體在 GPU 與主記憶體之間會產生多餘的記憶體複製，導致顯著的延遲。為了解決這個問題，我們提出了認知 GPU 的發佈－訂閱通訊機制（GPU-Aware Pub/Sub communication，簡稱 GAPS），這是一種通用解決方案，將共享的 CUDA 記憶體與現有的發佈－訂閱中介軟體（如 Zenoh-pico 和 Iceoryx）整合在一起。GAPS 透過讓發佈者與訂閱者共享 GPU 記憶體，來消除不必要的記憶體複製，進而大幅降低資料傳輸延遲。在我們的設計中，我們提出了一個獨立的共享 CUDA 記憶體管理器，會在每個「主題」初始化時，為該「主題」建立一個共享的 CUDA 記憶體池。為了在此記憶體池實現細粒度的記憶體分配，我們修改了一種即時動態記憶體配置器 Two-Level Segregated Fit（TLSF），使其具備多執行緒安全性且能管理 GPU 記憶體。此外，我們還開發了 PyGAPS，一個用於加速發佈 PyTorch 張量的延伸版本，能消除在人工智慧應用中的序列化開銷。根據我們的實驗結果，GAPS 顯著降低端到端延遲，並提升簡化的電腦視覺流程的吞吐量（在影像分割任務中提升最多達 1.5 倍，在分類任務中提升最多達 3.8 倍），是一個適用於即時邊緣人工智慧的穩健解決方案。 Real-time Edge AI applications often require efficient GPU-based data processing and communication. Since the applications are typically highly modularized, publish–subscribe (pub/sub) pattern is widely used to deliver data among components. However, existing pub/sub middleware introduces significant latency due to redundant memory copies between GPU and host memory. To address this, we propose GPU-Aware Pub/Sub communication (GAPS), a universal solution that integrates shared CUDA memory with existing pub/sub middleware, such as Zenoh-pico and Iceoryx. GAPS minimizes data transfer latency by enabling GPU memory sharing between publishers and subscribers, eliminating unnecessary memory copies. In our work, we propose an independent shared CUDA memory manager that creates a shared CUDA memory pool for each topic during a topic’s initialization. For fine-grained allocation from the pool, we modify Two-Level Segregated Fit (TLSF), a real-time dynamic memory allocator, making it process-safe and capable of managing GPU memory. Additionally, we develop PyGAPS, an extension that accelerates publications of PyTorch tensors, eliminating serialization overhead in AI-driven applications. Our evaluation demonstrates that GAPS significantly reduces end-to-end latency and improves throughput of simplified computer vision pipelines—by up to 1.5× in the segmentation task and 3.8× in the classification task—making it a robust solution for real-time Edge AI.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97774
DOI:	10.6342/NTU202501397
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2025-07-17
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf Access limited in NTU ip range	2.05 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets