在雲端往返延遲與運算資源限制下之 AR 眼鏡移動人臉標記

薛仁豪; Ren-Hau Shiue

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101746

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳宏銘	zh_TW
dc.contributor.advisor	Homer H. Chen	en
dc.contributor.author	薛仁豪	zh_TW
dc.contributor.author	Ren-Hau Shiue	en
dc.date.accessioned	2026-03-04T16:14:36Z	-
dc.date.available	2026-03-05	-
dc.date.copyright	2026-03-04	-
dc.date.issued	2026	-
dc.date.submitted	2026-02-25	-
dc.identifier.citation	[1] A. L. Duwaer and G. Van Den Brink, "What is the diplopia threshold?" Perception & Psychophysics, vol. 29, no. 4, pp. 295–309, 1981. [2] D. A. B. Rahardjo and H. H. Chen, "Cloud-Based Face Recognition for Augmented Reality Glasses," IEEE International Symposium on Mixed and Augmented Reality Adjunct, pp. 685–688, 2023. [3] J. Liao, Y. Xu, Y. Guan, and G. Liu, "An Augmented Classroom Teaching System based on AR and Facial Recognition," The Journal of Applied Instructional Design, vol. 13, no. 2, pp. 11–16, 2024. [4] M. Łysakowski, K. Żywanowski, A. Banaszczyk, M. R. Nowicki, P. Skrzypczyński, and S. K. Tadeja, "Real-time onboard object detection for augmented reality: Enhancing head-mounted display with YOLOv8," IEEE International Conference on Edge Computing and Communications, pp. 364–371, 2023. [5] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann, "MediaPipe: A Framework for Building Perception Pipelines," Proceedings of the Third Workshop on Computer Vision for AR/VR at IEEE Conference on Computer Vision and Pattern Recognition, 2019. [6] A. Farasin, F. Peciarolo, M. Grangetto, E. Gianaria, and P. Garza, "Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens," in Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, vol. 4, pp. 165–172, 2020. [7] S. Avidan and A. Shashua, “Trajectory Triangulation: 3D Reconstruction of Moving Points from a Monocular Image Sequence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 4, pp. 348–357, 2000. [8] H. S. Park, T. Shiratori, I. Matthews, and Y. Sheikh, "3D Reconstruction of a Moving Point from a Series of 2D Projections," European Conference on Computer Vision, vol. 6313, pp. 158–171, 2010. [9] L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, "Depth Anything V2," Advances in Neural Information Processing Systems, vol. 37, pp. 21875–21911, 2024. [10] B. Mandal, S.-C. Chia, L. Li, V. Chandrasekhar, C. Tan, and J.-H. Lim, "A Wearable Face Recognition System on Google Glass for Assisting Social Interactions," Asian Conference on Computer Vision Workshops, vol. 9010, pp. 419–433, 2015. [11] O. Daescu, H. Huang, and M. Weinzierl, "Deep learning based face recognition system with smart glasses," in Proceedings of the 12th ACM International Conference on Pervasive Technologies Related to Assistive Environments, pp. 218–226, 2019. [12] C. McKelvey, R. Dreyer, D. Zhu, W. Wang, and J. Quarles, "Energy-Oriented Designs of an Augmented-Reality Application on a Vuzix Blade Smart Glass," in Proceedings of the IEEE Tenth International Green and Sustainable Computing Conference, pp. 1–8, 2019. [13] C. Tomasi and T. Kanade, "Shape and motion from image streams under orthography," International Journal of Computer Vision, vol. 9, no. 2, pp. 137–154, 1992. [14] R. Smith, M. Self, and P. Cheeseman, "A Stochastic Map for Uncertain Spatial Relationships," in Proceedings of the Fourth International Symposium on Robotics Research, pp. 467–474, 1988. [15] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, "Digging into Self-Supervised Monocular Depth Estimation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838, 2019. [16] Y. Long, H. Yu, and B. Liu, "Depth completion towards different sensor configurations via relative depth map estimation and scale recovery," Journal of Visual Communication and Image Representation, vol. 80, art. no. 103272, 2021. [17] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah, "Signature verification using a Siamese time delay neural network," International Journal of Pattern Recognition and Artificial Intelligence, vol. 7, no. 4, pp. 669–688, 1993. [18] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, "Fully-convolutional Siamese networks for object tracking," European Conference on Computer Vision Workshops, vol. 9914, pp. 850–865, 2016. [19] V. Borsuk, R. Vei, O. Kupyn, T. Martyniuk, I. Krashenyi, and J. Matas, "FEAR: Fast, Efficient, Accurate and Robust Visual Tracker," European Conference on Computer Vision, Lecture Notes in Computer Science, vol. 13682, pp. 644–663, 2022. [20] G. Barquero, C. Fernández Tena, and I. Hupont, "Long-Term Face Tracking for Crowded Video-Surveillance Scenarios," IEEE International Joint Conference on Biometrics, pp. 42–49, 2020. [21] Y. Wong, S. Chen, S. Mau, C. Sanderson, and B. C. Lovell, "Patch-based Probabilistic Image Quality Assessment for Face Selection and Improved Video-based Face Recognition," IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 81–88, 2011. [22] Y. Lin, S. Cheng, J. Shen, and M. Pantic, "MobiFace: A Novel Dataset for Mobile Face Tracking in the Wild," IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–8, 2019. [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "SSD: Single Shot MultiBox Detector," European Conference on Computer Vision, Lecture Notes in Computer Science, vol. 9905, pp. 21–37, 2016. [24] Z. He, J. Zhang, M. Kan, S. Shan, and X. Chen, "Robust FEC-CNN: A High Accuracy Facial Landmark Detection System," IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2044–2050, 2017. [25] E. O. Ewunonu and C. I. P. Anibeze, "Anthropometric Study of the Facial Morphology in a South-Eastern Nigerian Population," Human Biology Review, vol. 2, no. 4, pp. 314–323, 2013. [26] S. Wu, M. Kan, Z. He, S. Shan, and X. Chen, "Funnel-Structured Cascade for Multi-View Face Detection with Alignment-Awareness," Neurocomputing, vol. 221, pp. 138–145, 2017. [27] M. Kim, A. K. Jain, and X. Liu, "AdaFace: Quality Adaptive Margin for Face Recognition," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18750–18759, 2022. [28] T. Qin, P. Li, and S. Shen, "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator," IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018. [29] S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang, "Video Depth Anything: Consistent Depth Estimation for Super-Long Videos," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22831–22840, 2025.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101746	-
dc.description.abstract	具備人臉辨識能力的擴增實境眼鏡，能在視野中的人物上疊加虛擬姓名標籤，使使用者在不打斷眼神接觸的情況下辨識他人。然而，在輕量化擴增實境眼鏡上實現可靠的標籤擺放仍具挑戰，主因包括雲端往返延遲、眼鏡端運算資源受限，以及缺乏深度感測器。延遲會使得標籤落後於移動的目標，而當標籤擺放的深度與真實目標的深度不一致時，則在使用者注視目標時可能產生複視的現象。本論文提出一套眼鏡端與伺服器端協作的人臉標記系統，能在眼鏡端提供即時的人臉定位，以及在不依賴深度感測器的前提下，估測移動目標的絕對尺度深度。在人臉定位方面，我們設計了一個基於孿生網路的拆分式定位流程，將計算分流至眼鏡端與伺服器端。伺服器端週期性更新目標模板，而眼鏡端僅執行輕量的搜尋分支，以達成即時定位。在深度估計方面，我們結合視覺慣性同步定位與建圖系統SLAM與單目深度估計網路，利用SLAM提供的稀疏絕對深度樣本，將網路輸出的相對深度圖進行尺度對齊，得到具有絕對尺度的深度圖，並由此估計目標深度。在三個公開資料集、延遲感知的評估協定下，本系統的人臉定位在 93.11% 的影格中可達到IoU大於0.5，優於眼鏡端人臉偵測的78.08% 與伺服器端人臉偵測的 53.90%。此外，在四個具代表性的社交情境中，我們的深度估計在96.15%的影格中，能將深度誤差維持在複視閾值以下，優於人臉寬度先驗基線的85.40% 與Depth Anything V2的78.11%。	zh_TW
dc.description.abstract	Augmented reality (AR) glasses with face recognition can overlay virtual name labels on people in view, allowing users to identify others without breaking eye contact. However, reliable label placement on lightweight AR glasses remains challenging due to round-trip latency, limited on-glasses computing resources, and the lack of a depth sensor. Latency causes the rendered label to lag behind the moving target, while depth mismatch can induce double vision when the user fixates on the target. We present a glasses–server collaborative face tagging system that provides real-time on-glasses face localization and metric depth estimation for moving targets without requiring a depth sensor. For face localization, we design a split Siamese pipeline that divides computation between the glasses and the server. The server periodically updates the target face template, while the glasses execute only the lightweight search branch. For depth estimation, we combine visual–inertial SLAM with a monocular depth estimation network. We use sparse metric depth samples from SLAM to affinely align the network’s relative depth map to metric scale, producing a metric depth map from which we estimate the target face depth. Evaluated under a latency-aware protocol on three public datasets, our face localization achieves an IoU above 0.5 in 93.25% of frames on average, outperforming on-glasses face detection at 77.38% and server-side face detection at 53.99%. In addition, averaged across four representative social scenarios, our depth estimation keeps the depth error below the diplopia threshold in 96.15% of frames on average, surpassing the Face-Width Prior baseline at 85.40% and Depth Anything V2 at 78.11%.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-03-04T16:14:36Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-03-04T16:14:36Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vi LIST OF TABLES viii Chapter 1 Introduction 1 Chapter 2 Related Work 7 2.1 Face Recognition on Smart Glasses and AR Glasses 7 2.2 Monocular Depth Estimation 8 2.3 Siamese Networks for Similarity Learning 10 Chapter 3 Proposed System 12 3.1 Problem Formulation 12 3.2 System Architecture 13 3.3 Face Localization Pipeline 15 3.3.1 Server-Assisted Template Initialization and Update 15 3.3.2 Search Region Selection 17 3.3.3 Similarity Map from Split Siamese Matching 17 3.3.4 Deriving the Label Anchor from Facial Landmarks 18 3.4 Depth Estimation Pipeline 20 3.4.1 Sparse Metric Depth Samples from Visual-Inertial SLAM 20 3.4.2 Relative Depth Map from Monocular Depth Estimation 22 3.4.3 Affine Alignment to Metric Depth Samples 22 3.5 Delay Compensation 23 Chapter 4 Experiments 25 4.1 Evaluation of Face Localization 25 4.1.1 Datasets 25 4.1.2 Metrics and Latency-Aware Evaluation Protocol 26 4.1.3 Experimental Setup and Hardware Platforms 27 4.1.4 Localization Accuracy across Various Compute Power Levels 28 4.1.5 Localization Accuracy under Various Network Latency Settings 31 4.2 Evaluation of Depth Estimation 34 4.2.1 Data Collection 34 4.2.2 Perceptually Motivated Metrics and Evaluation Protocol 34 4.2.3 Compared Methods and Implementation Details 36 4.2.4 Depth Estimation Results 38 4.3 System Implementation and Runtime Evaluation 40 4.3.1 Implementation Details 40 4.3.2 Power, Battery, and Thermal Characteristics 42 Chapter 5 Discussion and Future Work 45 Chapter 6 Conclusion 46 REFERENCES 47	-
dc.language.iso	en	-
dc.subject	擴增實境	-
dc.subject	人臉標記	-
dc.subject	雲端與邊緣運算	-
dc.subject	孿生網路	-
dc.subject	深度估計	-
dc.subject	Augmented reality	-
dc.subject	Face tagging	-
dc.subject	Cloud and edge computing	-
dc.subject	Siamese network	-
dc.subject	Depth estimation	-
dc.title	在雲端往返延遲與運算資源限制下之 AR 眼鏡移動人臉標記	zh_TW
dc.title	Tagging Moving Faces for AR Glasses under Round-Trip Latency and Computing Resource Constraints	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	鍾國亮;林澤;李佩君;施光祖	zh_TW
dc.contributor.oralexamcommittee	Kuo-Liang Chung;Che Lin;Pei-Jun Lee;Kuang-Tsu Shih	en
dc.subject.keyword	擴增實境,人臉標記雲端與邊緣運算孿生網路深度估計	zh_TW
dc.subject.keyword	Augmented reality,Face taggingCloud and edge computingSiamese networkDepth estimation	en
dc.relation.page	50	-
dc.identifier.doi	10.6342/NTU202600774	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-02-25	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2026-03-05	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	2.52 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。