基於Transformer之跨視角人體姿勢表徵學習及其於太極XR練習之應用

劉倢希; Chieh-Si Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98815

標題:	基於Transformer之跨視角人體姿勢表徵學習及其於太極XR練習之應用 Transformer-based Cross-View Human Pose Representation Learning and Its Application to Tai Chi XR Practice
作者:	劉倢希 Chieh-Si Liu
指導教授:	洪一平 Yi-Ping Hung
關鍵字:	姿勢嵌入,Transformer模型,跨視角姿勢檢索,影片對齊,太極拳,延展實境, Pose Embedding,Transformer,Cross-View Pose Retrieval,Video Alignment,Tai Chi,Extended Reality (XR),
出版年 :	2025
學位:	碩士
摘要:	人體姿勢理解是許多動作分析與互動應用中的關鍵技術。近期研究開始由2D骨架出發，學習具備3D語意且對視角變化具有不變性的嵌入表示(embedding)，以因應傳統3D重建方法在缺乏相機參數下產生的尺度與旋轉誤差，提升多視角場景中動作檢索、對齊與辨識的穩定性與泛化能力。我們整合現有的Transformer架構與跨視角訓練策略，建構一個能同時捕捉人體動作之時序與空間特徵的學習框架。我們採用DSTformer作為骨幹模型，利用其空間與時間建模能力，從2D骨架序列中擷取關節間的空間關係與關節隨時間變化的時序特徵。模型透過三元組式(Triplet)的跨視角訓練機制，學習在不同視角下皆具一致性的姿勢嵌入表示。為驗證方法效能，我們於跨視角姿勢序列檢索與影片對齊任務進行下游實驗，並進一步將所提模型應用於太極拳XR練習，讓使用者在不受攝影機視角限制的情境下進行學習，並在虛擬教練引導下即時獲得動作完成判斷之回饋，展現本方法於跨視角人體動作分析領域的實用性與應用潛力。 Human pose understanding is essential to many motion analysis and interactive applications. Recent studies have explored learning view-invariant representations from 2D skeletons, which capture 3D semantic information without relying on camera parameters. This direction aims to overcome the scale and rotation inconsistencies often encountered in traditional 3D reconstruction approaches, thereby improving the stability and generalizability of pose retrieval, alignment, and recognition tasks under multi-view scenarios. In this work, we integrate an existing Transformer-based architecture with a cross-view training strategy to construct a learning framework that captures both temporal and spatial features of human motion. We adopt DSTformer as the backbone model due to its spatio-temporal modeling capabilities, enabling the extraction of spatial relationships between joints and temporal dynamics of joints from 2D pose sequences. The model is trained using a triplet-based cross-view learning scheme to ensure cross-view and semantically consistent embeddings. To evaluate the effectiveness of our method, we conduct experiments on two downstream tasks: cross-view pose sequence retrieval and video alignment. Furthermore, we deploy our proposed model in Tai Chi XR practice, where users can practice under unconstrained camera viewpoints and receive real-time feedback on motion completion while following a virtual instructor. These results confirm the applicability of the proposed approach in realistic cross-view motion scenarios.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98815
DOI:	10.6342/NTU202501403
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2030-08-06
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	7.3 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。