在雲端往返延遲與運算資源限制下之 AR 眼鏡移動人臉標記

薛仁豪; Ren-Hau Shiue

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101746

Title:	在雲端往返延遲與運算資源限制下之 AR 眼鏡移動人臉標記 Tagging Moving Faces for AR Glasses under Round-Trip Latency and Computing Resource Constraints
Authors:	薛仁豪 Ren-Hau Shiue
Advisor:	陳宏銘 Homer H. Chen
Keyword:	擴增實境,人臉標記雲端與邊緣運算孿生網路深度估計 Augmented reality,Face taggingCloud and edge computingSiamese networkDepth estimation
Publication Year :	2026
Degree:	碩士
Abstract:	具備人臉辨識能力的擴增實境眼鏡，能在視野中的人物上疊加虛擬姓名標籤，使使用者在不打斷眼神接觸的情況下辨識他人。然而，在輕量化擴增實境眼鏡上實現可靠的標籤擺放仍具挑戰，主因包括雲端往返延遲、眼鏡端運算資源受限，以及缺乏深度感測器。延遲會使得標籤落後於移動的目標，而當標籤擺放的深度與真實目標的深度不一致時，則在使用者注視目標時可能產生複視的現象。本論文提出一套眼鏡端與伺服器端協作的人臉標記系統，能在眼鏡端提供即時的人臉定位，以及在不依賴深度感測器的前提下，估測移動目標的絕對尺度深度。在人臉定位方面，我們設計了一個基於孿生網路的拆分式定位流程，將計算分流至眼鏡端與伺服器端。伺服器端週期性更新目標模板，而眼鏡端僅執行輕量的搜尋分支，以達成即時定位。在深度估計方面，我們結合視覺慣性同步定位與建圖系統SLAM與單目深度估計網路，利用SLAM提供的稀疏絕對深度樣本，將網路輸出的相對深度圖進行尺度對齊，得到具有絕對尺度的深度圖，並由此估計目標深度。在三個公開資料集、延遲感知的評估協定下，本系統的人臉定位在 93.11% 的影格中可達到IoU大於0.5，優於眼鏡端人臉偵測的78.08% 與伺服器端人臉偵測的 53.90%。此外，在四個具代表性的社交情境中，我們的深度估計在96.15%的影格中，能將深度誤差維持在複視閾值以下，優於人臉寬度先驗基線的85.40% 與Depth Anything V2的78.11%。 Augmented reality (AR) glasses with face recognition can overlay virtual name labels on people in view, allowing users to identify others without breaking eye contact. However, reliable label placement on lightweight AR glasses remains challenging due to round-trip latency, limited on-glasses computing resources, and the lack of a depth sensor. Latency causes the rendered label to lag behind the moving target, while depth mismatch can induce double vision when the user fixates on the target. We present a glasses–server collaborative face tagging system that provides real-time on-glasses face localization and metric depth estimation for moving targets without requiring a depth sensor. For face localization, we design a split Siamese pipeline that divides computation between the glasses and the server. The server periodically updates the target face template, while the glasses execute only the lightweight search branch. For depth estimation, we combine visual–inertial SLAM with a monocular depth estimation network. We use sparse metric depth samples from SLAM to affinely align the network’s relative depth map to metric scale, producing a metric depth map from which we estimate the target face depth. Evaluated under a latency-aware protocol on three public datasets, our face localization achieves an IoU above 0.5 in 93.25% of frames on average, outperforming on-glasses face detection at 77.38% and server-side face detection at 53.99%. In addition, averaged across four representative social scenarios, our depth estimation keeps the depth error below the diplopia threshold in 96.15% of frames on average, surpassing the Face-Width Prior baseline at 85.40% and Depth Anything V2 at 78.11%.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101746
DOI:	10.6342/NTU202600774
Fulltext Rights:	同意授權(全球公開)
metadata.dc.date.embargo-lift:	2026-03-05
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-114-1.pdf	2.52 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets