在HoloLens2上部屬YOLO模型：平衡擴增實境應用中的效能與效率

陳沛君; Pei-Chun Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98394

標題:	在HoloLens2上部屬YOLO模型：平衡擴增實境應用中的效能與效率 Deploying YOLO Models on HoloLens 2: Balancing Performance and Efficiency in AR Applications
作者:	陳沛君 Pei-Chun Chen
指導教授:	黃瀅瑛 Ying-Yin Huang
關鍵字:	HoloLens 2,擴增實境（AR）,Nano-scale YOLO,裝置端推論（on-device inference）,即時物件辨識, HoloLens 2,Augmented Reality (AR),Nano-scale YOLO,On-device Inference,Real-time Object Detection,
出版年 :	2025
學位:	碩士
摘要:	本研究旨在探討四款輕量化 YOLO 模型（YOLOv8n、YOLOv10n、YOLOv11n、YOLOv12n）部署在微軟 HoloLens 2 頭戴式裝置上進行即時物件辨識的效能與效率平衡。隨著擴增實境（Augmented Reality, AR）與深度學習技術結合日益普及，頭戴式裝置如 HoloLens 2 的應用需求也逐步增加，然而由於裝置端計算資源有限，如何有效在其上部署高效能的物件辨識模型仍是重要研究議題。本研究針對這一議題，透過以 LEGO® 積木組裝流程為任務情境，比較四種 nano-scale YOLO 模型部屬在HoloLens 2上即時辨識組裝狀態的幀率（frame rate）、延遲（latency）與辨識信心度（confidence）。實測結果顯示，四款模型在 HoloLens 2 上的辨識表現存在明顯差異。在固定輸入解析度為 160×160 pixels 並透過 Unity Sentis 進行推論時，YOLOv11n 的表現最佳，幀率達到約 10–11 FPS，端到端延遲約為 90–100 ms，且辨識信心度高達 0.935，這代表 YOLOv11n 能夠有效滿足大部分即時互動的要求，提供較為流暢與可靠的使用體驗。其次，YOLOv8n 與 YOLOv12n 模型的表現則稍遜於 YOLOv11n。YOLOv8n 的幀率為 9–10 FPS，延遲約 100–111 ms，信心度約 0.912；而 YOLOv12n 幀率約 8–10 FPS，延遲稍高至 111–143 ms，但辨識信心度約0.919則略高於 YOLOv8n。相較之下，YOLOv10n 則在三個指標中明顯落後，幀率僅約 5 FPS，延遲高達 200–250 ms，信心度也最低（0.887），此性能較不適合即時應用，僅推薦用於對即時性要求較低的離線情境。而透過進一步分析比較各模型辨識細節，發現四款模型在特定步驟（例如 LEGO® 組裝流程的 Step 5 與 Step 6）均出現信心度明顯下降的情形。推測此現象源於這些步驟中 LEGO® 零件的視覺特徵較不明顯，難以區分，建議未來研究可透過增強訓練資料的多樣性與應用多尺度特徵融合技術改善此問題。此外，本研究亦驗證了以 Unity Sentis 作為裝置端推論引擎的可行性，實測結果顯示所有模型均能夠成功在此框架上穩定運行，表示此推論引擎已具備足夠的穩定性與相容性，可適用於頭戴式AR 裝置。綜合而言，本研究之結果驗證 nano-scale YOLO 模型在計算資源受限之頭戴式 AR 裝置上具有實際部署之可行性，而其中 YOLOv11n 模型達到最為平衡的即時辨識效能與穩定性。本研究成果不僅為未來 AR 應用的模型選擇提供明確指引，亦提出資料多樣性、功耗管理及使用者體驗之改善建議，期望未來研究能進一步提升裝置端物件辨識效能與使用者的整體互動體驗。 This study investigates the performance and efficiency balance of four nano-scale YOLO models (YOLOv8n, YOLOv10n, YOLOv11n, and YOLOv12n) deployed on Microsoft HoloLens 2 for real-time object detection in augmented reality (AR) scenarios. With the increasing integration of AR and deep learning technologies, head-mounted devices such as HoloLens 2 are gaining prominence. However, limited computational resources on such edge devices present significant challenges for deploying effective object detection models. This research addresses this issue by evaluating the frame rate (frames per second, FPS), latency, and detection confidence of four lightweight YOLO models, using a LEGO® assembly scenario for practical testing. Experimental results reveal notable differences in performance among the four models under identical conditions (input resolution: 160×160 pixels, inference via Unity Sentis). YOLOv11n achieved the best overall performance, with approximately 10–11 FPS, an end-to-end latency of about 90–100 ms, and an average detection confidence of 0.935. This indicates YOLOv11n effectively meets the requirements of most real-time interactions, offering a smooth and reliable user experience. YOLOv8n and YOLOv12n ranked second, with slightly inferior performance compared to YOLOv11n. Specifically, YOLOv8n achieved a frame rate of about 9–10 FPS, a latency of around 100–111 ms, and a confidence score of 0.912. Meanwhile, YOLOv12n had a frame rate of approximately 8–10 FPS, higher latency at 111–143 ms, but a slightly better confidence score (0.919) than YOLOv8n. In contrast, YOLOv10n significantly lagged in all three metrics, delivering around 5 FPS, a latency of 200–250 ms, and the lowest detection confidence of 0.887, making it unsuitable for real-time applications and recommended only for scenarios with minimal real-time demands. Furthermore, detailed analysis of model performance revealed a noticeable confidence reduction in certain LEGO® assembly steps (particularly Steps 5 and 6) across all models. This drop likely resulted from less distinguishable visual features in these steps. Future research is recommended to improve this issue by enhancing dataset diversity and incorporating multi-scale feature fusion techniques. Additionally, this study validated Unity Sentis as a feasible on-device inference framework. All models successfully ran on this platform without memory overflow or system crashes, demonstrating sufficient stability and compatibility for wearable AR devices. In conclusion, this research confirms the practicality of deploying nano-scale YOLO models on resource-constrained AR head-mounted devices, highlighting YOLOv11n as the best model for balancing real-time detection performance and stability. The findings provide clear guidance for future model selection in AR applications and offer recommendations regarding dataset diversity, power management, and user experience enhancement, aiming to continually improve on-device inference performance and overall user interactions in future research.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98394
DOI:	10.6342/NTU202502914
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-08-06
顯示於系所單位：	機械工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	4.18 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。