請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98815完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 洪一平 | zh_TW |
| dc.contributor.advisor | Yi-Ping Hung | en |
| dc.contributor.author | 劉倢希 | zh_TW |
| dc.contributor.author | Chieh-Si Liu | en |
| dc.date.accessioned | 2025-08-19T16:18:36Z | - |
| dc.date.available | 2025-08-20 | - |
| dc.date.copyright | 2025-08-19 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-06 | - |
| dc.identifier.citation | V. Bazarevsky and I. Grishchenko. On-device, real-time body pose tracking with mediapipe blazepose. Google AI Blog, 2020.
T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Computer Vision and Image Understanding, 164:27–37, 2017. T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):198–209, 2021. Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7103–7112, 2018. G. Donahue and E. Elhamifar. Learning to predict activity progress by self-supervised video alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18667–18677, 2024. D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Temporal cycle-consistency learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1801–1810, 2019. R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006. R. Huo, Y. Zhang, Y. Guo, Z. Ju, and Q. Gao. Gtformer: 3d driver body pose estimation in video with graph convolution network and transformer. IEEE Transactions on Intelligent Vehicles, 2023. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014. M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 1938. M. Levy and A. Shrivastava. V-vipe: Variational view invariant pose embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1633–1642, 2024. W. Li, H. Liu, H. Tang, P. Wang, and L. Van Gool. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13147–13156, 2022. X. Li, M. Meng, Z. Wu, T. Chen, F. Yang, and D. Shen. Self-learning canonical space for multi-view 3d human pose estimation. arXiv preprint arXiv:2403.12440, 2024. T. Liu, J. J. Sun, L. Zhao, J. Zhao, L. Yuan, Y. Wang, L.-C. Chen, F. Schroff, and H. Adam. View-invariant, occlusion-robust probabilistic embedding for human pose. International Journal of Computer Vision, 130(1):111–135, 2022. W. Liu, B. Tekin, H. Coskun, V. Vineet, P. Fua, and M. Pollefeys. Learning to align sequential actions in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2181–2191, 2022. D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), pages 506–516. IEEE, 2017. D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera. Acm transactions on graphics (tog), 36(4):1–14, 2017. I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European conference on computer vision, pages 527–544. Springer, 2016. S. J. Oh, K. Murphy, J. Pan, J. Roth, F. Schroff, and A. Gallagher. Modeling uncertainty with hedged instance embedding. In International Conference on Learning Representations, 2019. O. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC 2015-Proceedings of the British Machine Vision Conference 2015. British Machine Vision Association, 2015. G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7025–7034, 2017. E. Remelli, S. Han, S. Honari, P. Fua, and R. Wang. Lightweight multi-view 3d pose estimation through camera-disentangled representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6040–6049, 2020. S. Sankaranarayanan, A. Alavi, and R. Chellappa. Triplet similarity embedding for face verification. arXiv preprint arXiv:1602.03418, 2016. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018. J. J. Sun, J. Zhao, L.-C. Chen, F. Schroff, H. Adam, and T. Liu. View-invariant probabilistic embedding for human pose. In European Conference on Computer Vision, pages 53–70. Springer, 2020. B. Xiaohan Nie, C. Xiong, and S.-C. Zhu. Joint action recognition and pose estimation from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1293–1301, 2015. Y. Xu, J. Zhang, Q. Zhang, and D. Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information processing systems, 35:38571–38584, 2022. J. Zhang, Z. Tu, J. Yang, Y. Chen, and J. Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13232–13242, 2022. W. Zhang, M. Zhu, and K. G. Derpanis. From actemes to action: A strongly-supervised representation for detailed action understanding. In Proceedings of the IEEE international conference on computer vision, pages 2248–2255, 2013. Y. Zhang, C. Wang, X. Wang, W. Liu, and W. Zeng. Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2613–2626, 2022. C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11656–11665, 2021. W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15085–15099, 2023. C. Zimmermann, T. Welschehold, C. Dornhege, W. Burgard, and T. Brox. 3d human pose estimation in rgbd images for robotic task learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1986–1992. IEEE, 2018. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98815 | - |
| dc.description.abstract | 人體姿勢理解是許多動作分析與互動應用中的關鍵技術。近期研究開始由2D骨架出發,學習具備3D語意且對視角變化具有不變性的嵌入表示(embedding),以因應傳統3D重建方法在缺乏相機參數下產生的尺度與旋轉誤差,提升多視角場景中動作檢索、對齊與辨識的穩定性與泛化能力。
我們整合現有的Transformer架構與跨視角訓練策略,建構一個能同時捕捉人體動作之時序與空間特徵的學習框架。我們採用DSTformer作為骨幹模型,利用其空間與時間建模能力,從2D骨架序列中擷取關節間的空間關係與關節隨時間變化的時序特徵。模型透過三元組式(Triplet)的跨視角訓練機制,學習在不同視角下皆具一致性的姿勢嵌入表示。 為驗證方法效能,我們於跨視角姿勢序列檢索與影片對齊任務進行下游實驗,並進一步將所提模型應用於太極拳XR練習,讓使用者在不受攝影機視角限制的情境下進行學習,並在虛擬教練引導下即時獲得動作完成判斷之回饋,展現本方法於跨視角人體動作分析領域的實用性與應用潛力。 | zh_TW |
| dc.description.abstract | Human pose understanding is essential to many motion analysis and interactive applications. Recent studies have explored learning view-invariant representations from 2D skeletons, which capture 3D semantic information without relying on camera parameters. This direction aims to overcome the scale and rotation inconsistencies often encountered in traditional 3D reconstruction approaches, thereby improving the stability and generalizability of pose retrieval, alignment, and recognition tasks under multi-view scenarios.
In this work, we integrate an existing Transformer-based architecture with a cross-view training strategy to construct a learning framework that captures both temporal and spatial features of human motion. We adopt DSTformer as the backbone model due to its spatio-temporal modeling capabilities, enabling the extraction of spatial relationships between joints and temporal dynamics of joints from 2D pose sequences. The model is trained using a triplet-based cross-view learning scheme to ensure cross-view and semantically consistent embeddings. To evaluate the effectiveness of our method, we conduct experiments on two downstream tasks: cross-view pose sequence retrieval and video alignment. Furthermore, we deploy our proposed model in Tai Chi XR practice, where users can practice under unconstrained camera viewpoints and receive real-time feedback on motion completion while following a virtual instructor. These results confirm the applicability of the proposed approach in realistic cross-view motion scenarios. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-19T16:18:36Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-19T16:18:36Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 ii Abstract iii Contents v List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Human Pose Representation Learning . . . . . . . . . . . . . . . . . 5 2.3 Temporal Pose Modeling . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 3 Method 7 3.1 Transformer-based Pose Embedding . . . . . . . . . . . . . . . . . . 8 3.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.1 Triplet Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.2 Triplet Ratio Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.3 Positive Pairwise Loss . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.4 Overall Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Pose Normalization and Camera Augmentation . . . . . . . . . . . . 12 Chapter 4 Experiments 14 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Cross-View Pose Sequence Retrieval . . . . . . . . . . . . . . . . . 16 4.3.1 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Video Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.1 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter 5 Application to Tai Chi XR Practice 23 5.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 Model Adaptation for Tai Chi . . . . . . . . . . . . . . . . . . . . . 24 5.3 Interface Implementation . . . . . . . . . . . . . . . . . . . . . . . .25 5.3.1 Virtual Instructor . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.3.3 Real-Time Processing Pipeline . . . . . . . . . . . . . . . . . . . . 28 Chapter 6 Conclusion and Future Work 30 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 References 33 | - |
| dc.language.iso | en | - |
| dc.subject | 姿勢嵌入 | zh_TW |
| dc.subject | Transformer模型 | zh_TW |
| dc.subject | 跨視角姿勢檢索 | zh_TW |
| dc.subject | 影片對齊 | zh_TW |
| dc.subject | 太極拳 | zh_TW |
| dc.subject | 延展實境 | zh_TW |
| dc.subject | Cross-View Pose Retrieval | en |
| dc.subject | Pose Embedding | en |
| dc.subject | Extended Reality (XR) | en |
| dc.subject | Tai Chi | en |
| dc.subject | Video Alignment | en |
| dc.subject | Transformer | en |
| dc.title | 基於Transformer之跨視角人體姿勢表徵學習及其於太極XR練習之應用 | zh_TW |
| dc.title | Transformer-based Cross-View Human Pose Representation Learning and Its Application to Tai Chi XR Practice | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 陳錫中;陳文翔;陳冠文;陳永祥 | zh_TW |
| dc.contributor.oralexamcommittee | Hsi-Chung Chen;Wen-Shiang Chen;Kuan-Wen Chen;Yong-Xiang Chen | en |
| dc.subject.keyword | 姿勢嵌入,Transformer模型,跨視角姿勢檢索,影片對齊,太極拳,延展實境, | zh_TW |
| dc.subject.keyword | Pose Embedding,Transformer,Cross-View Pose Retrieval,Video Alignment,Tai Chi,Extended Reality (XR), | en |
| dc.relation.page | 37 | - |
| dc.identifier.doi | 10.6342/NTU202501403 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-08-12 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | - |
| dc.date.embargo-lift | 2030-08-06 | - |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 7.3 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
