透過共享 GPU 記憶體的發佈/訂閱中介軟體實現低延遲邊緣人工智慧通訊

官澔恩; Hao-En Kuan

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97774

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪士灝	zh_TW
dc.contributor.advisor	Shih-Hao Hung	en
dc.contributor.author	官澔恩	zh_TW
dc.contributor.author	Hao-En Kuan	en
dc.date.accessioned	2025-07-16T16:13:25Z	-
dc.date.available	2025-07-17	-
dc.date.copyright	2025-07-16	-
dc.date.issued	2025	-
dc.date.submitted	2025-07-03	-
dc.identifier.citation	[1] O. Bell, C. Gill, and X. Zhang. Hardware acceleration with zero-copy memory management for heterogeneous computing. In 2023 IEEE 29th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pages 28–37, 2023. [2] J. Bonwick. The slab allocator: an object-caching kernel memory allocator. In Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1, USTC’94, page 6, USA, 1994. USENIX Association. [3] A. Corsaro, L. Cominardi, O. Hecart, G. Baldoni, J. E. P. Avital, J. Loudet, C. Guimares, M. Ilyin, and D. Bannov. Zenoh: Unifying communication, storage and computation from the cloud to the microcontroller. In 2023 26th Euromicro Conference on Digital System Design (DSD), pages 422–428, 2023. [4] Distributed (Deep) Machine Learning Community. dlpack: common in-memory tensor structure. https://github.com/dmlc/dlpack, 2017. Accessed: 2025-04-27. [5] Eclipse-Cyclonedds. Github - eclipse-cyclonedds/cyclonedds: Eclipse cyclone dds project. https://github.com/eclipse-cyclonedds/cyclonedds, 2019. Accessed: 2025-04-27. [6] Eclipse-Iceoryx. Github - eclipse-iceoryx/ iceoryx: true zero-copy inter-process-communication. https://github.com/eclipse-iceoryx/iceoryx, 2019. Accessed: 2025-04-27. [7] Eclipse-Iceoryx. Github - eclipse-iceoryx/iceoryx2: Eclipse iceoryx2tm - true zero-copy inter-process-communication in pure rust. https://github.com/eclipse-iceoryx/iceoryx2, 2023. Accessed: 2025-04-27. [8] R. Giannessi, A. Biondi, and A. Biasci. RT-Mimalloc: A New Look at Dynamic Memory Allocation for Real-Time Systems. In 2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 173–185, Los Alamitos, CA, USA, May 2024. IEEE Computer Society. [9] H. Hua, Y. Li, T. Wang, N. Dong, W. Li, and J. Cao. Edge computing with artificial intelligence: A machine learning perspective. ACM Comput. Surv., 55(9), jan 2023. [10] W. Jakob. nanobind: tiny and efficient c++/python bindings. https://github.com/wjakob/nanobind, 2022. Accessed: 2025-04-27. [11] W. Jakob, J. Rhinelander, and D. Moldovan. pybind11: Seamless operability between c++11 and python. https://github.com/pybind/pybind11, 2016. Accessed: 2025-04-27. [12] G. Jocher, J. Qiu, and A. Chaurasia. Ultralytics YOLO. https://github.com/ultralytics/ultralytics, Jan. 2023. Accessed: 2025-04-27. [13] M. S. Johnstone and P. R. Wilson. The memory fragmentation problem: solved? SIGPLAN Not., 34(3):26–36, oct 1998. [14] A. Kanametov. yolo-face: Yolo face in pytorch. https://github.com/akanametov/yolo-face, 2024. Accessed: 2025-04-27. [15] M. Khasgiwale, V. Sharma, S. Mishra, B. Thadichi, J. John, and R. Khanna. Shimmy: Accelerating inter-container communication for the iot edge. In GLOBECOM 2023 - 2023 IEEE Global Communications Conference, pages 4461–4466, 2023. [16] D. E. Knuth. The art of computer programming. volume 1: Fundamental algorithms. Journal of the American Statistical Association, 64(325):401, mar 1969. [17] D. Leijen, B. Zorn, and L. De Moura. Mimalloc: Free list sharding in action. In Programming Languages and Systems: 17th Asian Symposium, APLAS 2019, Nusa Dua, Bali, Indonesia, December 1–4, 2019, Proceedings 17, pages 244–265. Springer, 2019. [18] W.-Y. Liang, Y. Yuan, and H.-J. Lin. A performance study on the throughput and latency of zenoh, mqtt, kafka, and dds. https://arxiv.org/abs/2303.09419, 2023. Accessed: 2025-04-27. [19] H. Lin. Embedded artificial intelligence: Intelligence on devices. Computer, 56(09):90–93, sep 2023. [20] M. Masmano, I. Ripoll, A. Crespo, and J. Real. Tlsf: a new dynamic memory allocator for real-time systems. In Proceedings. 16th Euromicro Conference on Real-Time Systems, 2004. ECRTS 2004., pages 79–88, 2004. [21] M. Masmano, I. Ripoll, J. Real, A. Crespo, and A. Wellings. Implementation of a constant‐time dynamic storage allocator. Software: Practice and Experience, 38:995 – 1026, 08 2008. [22] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21–65, feb 1991. [23] NVIDIA. Cuda for tegra release 12.8. https://docs.nvidia.com/cuda/pdf/CUDA-For-Tegra-AppNote.pdf, 2025. Please refer to Section 3.6. CUDA Features Not Supported on Tegra. Accessed: 2025-04-27. [24] F. Oliveira, D. G. Costa, F. Assis, and I. Silva. Internet of intelligent things: A convergence of embedded systems, edge computing and machine learning. Internet of Things, 26:101153, 2024. [25] Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong. Tju-dhd: A diverse high-resolution dataset for object detection. IEEE Transactions on Image Processing, 2021. [26] R. Singh and S. S. Gill. Edge ai: A survey. Internet of Things and Cyber-Physical Systems, 3:71–92, 2023. [27] H. Wu, J. Jin, J. Zhai, Y. Gong, and W. Liu. Accelerating gpu message communication for autonomous navigation systems. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 181–191, 2021. [28] J. Zhang, X. Yu, S. Ha, J. P. Queralta, and T. Westerlund. Comparison of dds, mqtt, and zenoh in edge-to-edge and edge-to-cloud communication for distributed ros 2 systems. https://arxiv.org/abs/2309.07496, 2023. Accessed: 2025-04-27.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97774	-
dc.description.abstract	即時邊緣人工智慧應用通常需要有效率的 GPU 資料處理與傳輸。由於這類應用通常高度模組化，因此廣泛使用發佈－訂閱模式以在各個元件之間傳遞資料。然而，現有的發佈－訂閱中介軟體在 GPU 與主記憶體之間會產生多餘的記憶體複製，導致顯著的延遲。為了解決這個問題，我們提出了認知 GPU 的發佈－訂閱通訊機制（GPU-Aware Pub/Sub communication，簡稱 GAPS），這是一種通用解決方案，將共享的 CUDA 記憶體與現有的發佈－訂閱中介軟體（如 Zenoh-pico 和 Iceoryx）整合在一起。GAPS 透過讓發佈者與訂閱者共享 GPU 記憶體，來消除不必要的記憶體複製，進而大幅降低資料傳輸延遲。在我們的設計中，我們提出了一個獨立的共享 CUDA 記憶體管理器，會在每個「主題」初始化時，為該「主題」建立一個共享的 CUDA 記憶體池。為了在此記憶體池實現細粒度的記憶體分配，我們修改了一種即時動態記憶體配置器 Two-Level Segregated Fit（TLSF），使其具備多執行緒安全性且能管理 GPU 記憶體。此外，我們還開發了 PyGAPS，一個用於加速發佈 PyTorch 張量的延伸版本，能消除在人工智慧應用中的序列化開銷。根據我們的實驗結果，GAPS 顯著降低端到端延遲，並提升簡化的電腦視覺流程的吞吐量（在影像分割任務中提升最多達 1.5 倍，在分類任務中提升最多達 3.8 倍），是一個適用於即時邊緣人工智慧的穩健解決方案。	zh_TW
dc.description.abstract	Real-time Edge AI applications often require efficient GPU-based data processing and communication. Since the applications are typically highly modularized, publish–subscribe (pub/sub) pattern is widely used to deliver data among components. However, existing pub/sub middleware introduces significant latency due to redundant memory copies between GPU and host memory. To address this, we propose GPU-Aware Pub/Sub communication (GAPS), a universal solution that integrates shared CUDA memory with existing pub/sub middleware, such as Zenoh-pico and Iceoryx. GAPS minimizes data transfer latency by enabling GPU memory sharing between publishers and subscribers, eliminating unnecessary memory copies. In our work, we propose an independent shared CUDA memory manager that creates a shared CUDA memory pool for each topic during a topic’s initialization. For fine-grained allocation from the pool, we modify Two-Level Segregated Fit (TLSF), a real-time dynamic memory allocator, making it process-safe and capable of managing GPU memory. Additionally, we develop PyGAPS, an extension that accelerates publications of PyTorch tensors, eliminating serialization overhead in AI-driven applications. Our evaluation demonstrates that GAPS significantly reduces end-to-end latency and improves throughput of simplified computer vision pipelines—by up to 1.5× in the segmentation task and 3.8× in the classification task—making it a robust solution for real-time Edge AI.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:13:25Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-16T16:13:25Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures ix List of Tables xi Chapter 1 Introduction 1 Chapter 2 Background and Related Work 4 2.1 Publish–Subscribe Pattern . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Shared CUDA Memory . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Real-time Dynamic Memory Allocator . . . . . . . . . . . . . . . . 7 2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3 Methodology 10 3.1 Shared Metadata Store . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Topic Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 TLSF Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.3 Message Queue Section . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Shared CUDA Memory Manager . . . . . . . . . . . . . . . . . . . 14 3.3 Publisher and Subscriber Node . . . . . . . . . . . . . . . . . . . . 15 Chapter 4 Implementation 17 4.1 Header-Detached Process-Safe TLSF . . . . . . . . . . . . . . . . 17 4.1.1 Detached Block Headers . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.2 Critical Section Protection . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Publisher/Subscriber Construction . . . . . . . . . . . . . . . . . 20 4.2.1 Shared Metadata Store Initialization . . . . . . . . . . . . . . . . 20 4.2.2 Memory Allocator Construction . . . . . . . . . . . . . . . . . . 22 4.2.3 Pub/Sub Instance Construction . . . . . . . . . . . . . . . . . . . 22 4.3 Message Publishing and Handling . . . . . . . . . . . . . . . . . . 23 4.3.1 Address Conversion . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 Payload Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.3 Quality of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4 Binding with Python . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 5 Evaluation 29 5.1 Environment Configuration . . . . . . . . . . . . . . . . . . . . . 29 5.2 End-to-End Latency . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.1 One-to-One Scenario . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.2 Many-to-One and One-to-Many Scenarios . . . . . . . . . . . . . 37 5.3 Computer Vision Pipeline Throughput . . . . . . . . . . . . . . . . 39 Chapter 6 Conclusion 41 References 43	-
dc.language.iso	en	-
dc.subject	即時動態記憶體配置器	zh_TW
dc.subject	發佈/訂閱中介軟體	zh_TW
dc.subject	圖形處理器	zh_TW
dc.subject	共享記憶體	zh_TW
dc.subject	Pub/Sub Middleware	en
dc.subject	GPU	en
dc.subject	Real-Time Dynamic Memory Allocator	en
dc.subject	Shared Memory	en
dc.title	透過共享 GPU 記憶體的發佈/訂閱中介軟體實現低延遲邊緣人工智慧通訊	zh_TW
dc.title	Low-Latency Edge AI Communication with Pub/Sub Middleware via GPU Memory Sharing	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	施吉昇;張原豪	zh_TW
dc.contributor.oralexamcommittee	Chi-Sheng Shih;Yuan-Hao Chang	en
dc.subject.keyword	發佈/訂閱中介軟體,圖形處理器,共享記憶體,即時動態記憶體配置器,	zh_TW
dc.subject.keyword	Pub/Sub Middleware,GPU,Shared Memory,Real-Time Dynamic Memory Allocator,	en
dc.relation.page	47	-
dc.identifier.doi	10.6342/NTU202501397	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-07-04	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-07-17	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.05 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。