Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97774
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor洪士灝zh_TW
dc.contributor.advisorShih-Hao Hungen
dc.contributor.author官澔恩zh_TW
dc.contributor.authorHao-En Kuanen
dc.date.accessioned2025-07-16T16:13:25Z-
dc.date.available2025-07-17-
dc.date.copyright2025-07-16-
dc.date.issued2025-
dc.date.submitted2025-07-03-
dc.identifier.citation[1] O. Bell, C. Gill, and X. Zhang. Hardware acceleration with zero-copy memory management for heterogeneous computing. In 2023 IEEE 29th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pages 28–37, 2023.

[2] J. Bonwick. The slab allocator: an object-caching kernel memory allocator. In Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1, USTC’94, page 6, USA, 1994. USENIX Association.

[3] A. Corsaro, L. Cominardi, O. Hecart, G. Baldoni, J. E. P. Avital, J. Loudet, C. Guimares, M. Ilyin, and D. Bannov. Zenoh: Unifying communication, storage and computation from the cloud to the microcontroller. In 2023 26th Euromicro Conference on Digital System Design (DSD), pages 422–428, 2023.

[4] Distributed (Deep) Machine Learning Community. dlpack: common in-memory tensor structure. https://github.com/dmlc/dlpack, 2017. Accessed: 2025-04-27.

[5] Eclipse-Cyclonedds. Github - eclipse-cyclonedds/cyclonedds: Eclipse cyclone dds project. https://github.com/eclipse-cyclonedds/cyclonedds, 2019. Accessed: 2025-04-27.

[6] Eclipse-Iceoryx. Github - eclipse-iceoryx/ iceoryx: true zero-copy inter-process-communication. https://github.com/eclipse-iceoryx/iceoryx, 2019. Accessed: 2025-04-27.

[7] Eclipse-Iceoryx. Github - eclipse-iceoryx/iceoryx2: Eclipse iceoryx2tm - true zero-copy inter-process-communication in pure rust. https://github.com/eclipse-iceoryx/iceoryx2, 2023. Accessed: 2025-04-27.

[8] R. Giannessi, A. Biondi, and A. Biasci. RT-Mimalloc: A New Look at Dynamic Memory Allocation for Real-Time Systems. In 2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 173–185, Los Alamitos, CA, USA, May 2024. IEEE Computer Society.

[9] H. Hua, Y. Li, T. Wang, N. Dong, W. Li, and J. Cao. Edge computing with artificial intelligence: A machine learning perspective. ACM Comput. Surv., 55(9), jan 2023.

[10] W. Jakob. nanobind: tiny and efficient c++/python bindings. https://github.com/wjakob/nanobind, 2022. Accessed: 2025-04-27.

[11] W. Jakob, J. Rhinelander, and D. Moldovan. pybind11: Seamless operability between c++11 and python. https://github.com/pybind/pybind11, 2016. Accessed: 2025-04-27.

[12] G. Jocher, J. Qiu, and A. Chaurasia. Ultralytics YOLO. https://github.com/ultralytics/ultralytics, Jan. 2023. Accessed: 2025-04-27.

[13] M. S. Johnstone and P. R. Wilson. The memory fragmentation problem: solved? SIGPLAN Not., 34(3):26–36, oct 1998.

[14] A. Kanametov. yolo-face: Yolo face in pytorch. https://github.com/akanametov/yolo-face, 2024. Accessed: 2025-04-27.

[15] M. Khasgiwale, V. Sharma, S. Mishra, B. Thadichi, J. John, and R. Khanna. Shimmy: Accelerating inter-container communication for the iot edge. In GLOBECOM 2023 - 2023 IEEE Global Communications Conference, pages 4461–4466, 2023.

[16] D. E. Knuth. The art of computer programming. volume 1: Fundamental algorithms. Journal of the American Statistical Association, 64(325):401, mar 1969.

[17] D. Leijen, B. Zorn, and L. De Moura. Mimalloc: Free list sharding in action. In Programming Languages and Systems: 17th Asian Symposium, APLAS 2019, Nusa Dua, Bali, Indonesia, December 1–4, 2019, Proceedings 17, pages 244–265. Springer, 2019.

[18] W.-Y. Liang, Y. Yuan, and H.-J. Lin. A performance study on the throughput and latency of zenoh, mqtt, kafka, and dds. https://arxiv.org/abs/2303.09419, 2023. Accessed: 2025-04-27.

[19] H. Lin. Embedded artificial intelligence: Intelligence on devices. Computer, 56(09):90–93, sep 2023.

[20] M. Masmano, I. Ripoll, A. Crespo, and J. Real. Tlsf: a new dynamic memory allocator for real-time systems. In Proceedings. 16th Euromicro Conference on Real-Time Systems, 2004. ECRTS 2004., pages 79–88, 2004.

[21] M. Masmano, I. Ripoll, J. Real, A. Crespo, and A. Wellings. Implementation of a constant‐time dynamic storage allocator. Software: Practice and Experience, 38:995 – 1026, 08 2008.

[22] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21–65, feb 1991.

[23] NVIDIA. Cuda for tegra release 12.8. https://docs.nvidia.com/cuda/pdf/CUDA-For-Tegra-AppNote.pdf, 2025. Please refer to Section 3.6. CUDA Features Not Supported on Tegra. Accessed: 2025-04-27.

[24] F. Oliveira, D. G. Costa, F. Assis, and I. Silva. Internet of intelligent things: A convergence of embedded systems, edge computing and machine learning. Internet of Things, 26:101153, 2024.

[25] Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong. Tju-dhd: A diverse high-resolution dataset for object detection. IEEE Transactions on Image Processing, 2021.

[26] R. Singh and S. S. Gill. Edge ai: A survey. Internet of Things and Cyber-Physical Systems, 3:71–92, 2023.

[27] H. Wu, J. Jin, J. Zhai, Y. Gong, and W. Liu. Accelerating gpu message communication for autonomous navigation systems. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 181–191, 2021.

[28] J. Zhang, X. Yu, S. Ha, J. P. Queralta, and T. Westerlund. Comparison of dds, mqtt, and zenoh in edge-to-edge and edge-to-cloud communication for distributed ros 2 systems. https://arxiv.org/abs/2309.07496, 2023. Accessed: 2025-04-27.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97774-
dc.description.abstract即時邊緣人工智慧應用通常需要有效率的 GPU 資料處理與傳輸。由於這類應用通常高度模組化,因此廣泛使用發佈-訂閱模式以在各個元件之間傳遞資料。然而,現有的發佈-訂閱中介軟體在 GPU 與主記憶體之間會產生多餘的記憶體複製,導致顯著的延遲。為了解決這個問題,我們提出了認知 GPU 的發佈-訂閱通訊機制(GPU-Aware Pub/Sub communication,簡稱 GAPS),這是一種通用解決方案,將共享的 CUDA 記憶體與現有的發佈-訂閱中介軟體(如 Zenoh-pico 和 Iceoryx)整合在一起。GAPS 透過讓發佈者與訂閱者共享 GPU 記憶體,來消除不必要的記憶體複製,進而大幅降低資料傳輸延遲。在我們的設計中,我們提出了一個獨立的共享 CUDA 記憶體管理器,會在每個「主題」初始化時,為該「主題」建立一個共享的 CUDA 記憶體池。為了在此記憶體池實現細粒度的記憶體分配,我們修改了一種即時動態記憶體配置器 Two-Level Segregated Fit(TLSF),使其具備多執行緒安全性且能管理 GPU 記憶體。此外,我們還開發了 PyGAPS,一個用於加速發佈 PyTorch 張量的延伸版本,能消除在人工智慧應用中的序列化開銷。根據我們的實驗結果,GAPS 顯著降低端到端延遲,並提升簡化的電腦視覺流程的吞吐量(在影像分割任務中提升最多達 1.5 倍,在分類任務中提升最多達 3.8 倍),是一個適用於即時邊緣人工智慧的穩健解決方案。zh_TW
dc.description.abstractReal-time Edge AI applications often require efficient GPU-based data processing and communication. Since the applications are typically highly modularized, publish–subscribe (pub/sub) pattern is widely used to deliver data among components. However, existing pub/sub middleware introduces significant latency due to redundant memory copies between GPU and host memory. To address this, we propose GPU-Aware Pub/Sub communication (GAPS), a universal solution that integrates shared CUDA memory with existing pub/sub middleware, such as Zenoh-pico and Iceoryx. GAPS minimizes data transfer latency by enabling GPU memory sharing between publishers and subscribers, eliminating unnecessary memory copies. In our work, we propose an independent shared CUDA memory manager that creates a shared CUDA memory pool for each topic during a topic’s initialization. For fine-grained allocation from the pool, we modify Two-Level Segregated Fit (TLSF), a real-time dynamic memory allocator, making it process-safe and capable of managing GPU memory. Additionally, we develop PyGAPS, an extension that accelerates publications of PyTorch tensors, eliminating serialization overhead in AI-driven applications. Our evaluation demonstrates that GAPS significantly reduces end-to-end latency and improves throughput of simplified computer vision pipelines—by up to 1.5× in the segmentation task and 3.8× in the classification task—making it a robust solution for real-time Edge AI.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:13:25Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-16T16:13:25Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements ii
摘要 iii
Abstract iv
Contents vi
List of Figures ix
List of Tables xi
Chapter 1 Introduction 1
Chapter 2 Background and Related Work 4
2.1 Publish–Subscribe Pattern . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Shared CUDA Memory . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Real-time Dynamic Memory Allocator . . . . . . . . . . . . . . . . 7
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 Methodology 10
3.1 Shared Metadata Store . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Topic Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 TLSF Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Message Queue Section . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Shared CUDA Memory Manager . . . . . . . . . . . . . . . . . . . 14
3.3 Publisher and Subscriber Node . . . . . . . . . . . . . . . . . . . . 15
Chapter 4 Implementation 17
4.1 Header-Detached Process-Safe TLSF . . . . . . . . . . . . . . . . 17
4.1.1 Detached Block Headers . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 Critical Section Protection . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Publisher/Subscriber Construction . . . . . . . . . . . . . . . . . 20
4.2.1 Shared Metadata Store Initialization . . . . . . . . . . . . . . . . 20
4.2.2 Memory Allocator Construction . . . . . . . . . . . . . . . . . . 22
4.2.3 Pub/Sub Instance Construction . . . . . . . . . . . . . . . . . . . 22
4.3 Message Publishing and Handling . . . . . . . . . . . . . . . . . . 23
4.3.1 Address Conversion . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 Payload Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.3 Quality of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Binding with Python . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 5 Evaluation 29
5.1 Environment Configuration . . . . . . . . . . . . . . . . . . . . . 29
5.2 End-to-End Latency . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 One-to-One Scenario . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Many-to-One and One-to-Many Scenarios . . . . . . . . . . . . . 37
5.3 Computer Vision Pipeline Throughput . . . . . . . . . . . . . . . . 39
Chapter 6 Conclusion 41
References 43
-
dc.language.isoen-
dc.subject即時動態記憶體配置器zh_TW
dc.subject發佈/訂閱中介軟體zh_TW
dc.subject圖形處理器zh_TW
dc.subject共享記憶體zh_TW
dc.subjectPub/Sub Middlewareen
dc.subjectGPUen
dc.subjectReal-Time Dynamic Memory Allocatoren
dc.subjectShared Memoryen
dc.title透過共享 GPU 記憶體的發佈/訂閱中介軟體實現低延遲邊緣人工智慧通訊zh_TW
dc.titleLow-Latency Edge AI Communication with Pub/Sub Middleware via GPU Memory Sharingen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee施吉昇;張原豪zh_TW
dc.contributor.oralexamcommitteeChi-Sheng Shih;Yuan-Hao Changen
dc.subject.keyword發佈/訂閱中介軟體,圖形處理器,共享記憶體,即時動態記憶體配置器,zh_TW
dc.subject.keywordPub/Sub Middleware,GPU,Shared Memory,Real-Time Dynamic Memory Allocator,en
dc.relation.page47-
dc.identifier.doi10.6342/NTU202501397-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-04-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2025-07-17-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
2.05 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved