請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101011| 標題: | 基於注意力引導特徵與領域自適應學習之實時雙目自我視角三維人體姿態估計系統 Real-Time Stereo Egocentric 3D Human Pose Estimation System with Attention-Guided Features and Domain-Adaptive Learning |
| 作者: | 范詠為 Yung-Wei Fan |
| 指導教授: | 簡韶逸 Shao-Yi Chien |
| 關鍵字: | 三維人體姿態估計,自我視角多層感知器非監督領域自適應頭戴顯示裝置擴增實境虛擬實境 3D Human Pose Estimation,EgocentricMultilayer PerceptronUnsupervised Domain AdaptationHead-Mounted DisplayAugmented RealityVirtual RealityZCU104 |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 自我視角三維人體姿態估計能夠在虛擬實境(VR)和擴增實境(AR)應用中提供自然且沉浸式的互動,無需依賴外部攝影機或手持控制器。這種方法為全身動作捕捉提供了一種靈活且便攜的解決方案,可無縫整合至頭戴式顯示裝置(HMD)與AR眼鏡。然而,現有方法面臨重大挑戰,包括魚眼鏡頭的變形、嚴重的自遮擋,以及視野範圍外的身體部位,這些問題都會降低姿態估計的準確度。此外,許多基於深度學習的方法需要高計算資源與複雜的模型架構設計,使其難以在邊緣設備上實時部署。
在本論文中,我們提出了一個適用於邊緣設備的實時雙目自我視角三維人體姿態估計系統。我們的系統引入了一個注意力引導特徵提取器(AFE),利用多尺度二維熱圖與人體遮罩注意力機制來增強特徵學習。此外,我們開發了一種立體關節混合器(SJM),這是一種簡單的多層感知機(MLP)模型,可整合雙目視覺特徵,同時保持計算效率與準確性。為了提升在真實環境中的適應性,我們引入無監督領域自適應(UDA),包括人體先驗約束與對抗性領域分類器訓練,以減少合成數據與真實世界數據之間的領域差距。 我們在Xilinx Zynq UltraScale+ MPSoC ZCU104 FPGA上部署了該系統,並在數據集推理與攝影機串流模式下實現了24–30 FPS的實時推理。此技術為可擴展的實時自我視角姿態估計奠定了基礎,並能夠提升VR/AR互動、人機介面(HCI)以及運動分析應用的效能。 Egocentric 3D human pose estimation enables natural and immersive interaction in virtual reality (VR) and augmented reality (AR) applications, eliminating the need for external cameras or handheld controllers. This approach provides a compact and mobile solution for full-body motion capture, allowing seamless integration with head-mounted display (HMD) and AR glasses. However, existing methods face significant challenges, including fisheye lens distortion, severe self-occlusion, and out-of-view body parts, which degrade estimation accuracy. Furthermore, many deep learning-based approaches require high computational resources and complex model architecture design, making real-time deployment on edge devices impractical. In this thesis, we propose a real-time stereo egocentric 3D human pose estimation system optimized for edge-device deployment. Our system introduces an Attention-Guided Feature Extractor (AFE) that utilizes multi-scale 2D heatmaps and human mask attention to enhance feature learning. Additionally, we develop a Stereo Joint Mixer (SJM), a simple MLP-based model that integrates stereo visual features while preserving computational efficiency and accuracy. To improve robustness in real-world environments, we incorporate unsupervised domain adaptation (UDA), including human prior constraints and adversarial domain classifier training, reducing the domain gap between synthetic and real-world data. We implement the system on the Xilinx Zynq UltraScale+ MPSoC ZCU104 FPGA, achieving real-time inference at 24-30 FPS in both dataset inference and camera streaming modes. This technology paves the way for scalable, real-time egocentric pose estimation, enabling enhanced VR/AR interaction, human-computer interfaces (HCI), and motion analysis applications. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101011 |
| DOI: | 10.6342/NTU202504485 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2025-11-27 |
| 顯示於系所單位: | 電子工程學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-114-1.pdf | 30.55 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
