基於視覺語言模型與記憶對比學習之零樣本骨架動作識別方法

魏子翔; Zi-Xiang Wei

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98695

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Jane Yung-Jen Hsu	en
dc.contributor.author	魏子翔	zh_TW
dc.contributor.author	Zi-Xiang Wei	en
dc.date.accessioned	2025-08-18T16:08:03Z	-
dc.date.available	2025-08-19	-
dc.date.copyright	2025-08-18	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-05	-
dc.identifier.citation	Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE multimedia, 19(2):4–10, 2012. Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems, 34:7281–7293, 2021. Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Pranay Gupta, Divyanshu Sharma, and Ravi Kiran Sarvadevabhatla. Syntactically guided generative embeddings for zero-shot skeleton action recognition. In Proceedings of IEEE International Conference on Image Processing (ICIP), pages 439–443. IEEE, 2021. Sheng-Wei Li, Zi-Xiang Wei, Wei-Jie Chen, Yi-Hsin Yu, Chih-Yuan Yang, and Jane Yung jen Hsu. SA-DVAE: Improving zero-shot skeleton-based action recognition by disentangled variational autoencoders. In roceedings of European Conference on Computer Vision (ECCV), 2024. J.W. Davis and A.F. Bobick. The representation and recognition of human movement using temporal templates. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1997. Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–267, 2001. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 4489–4497, 2015. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei‑Fei. Large‑scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1725–1732, 2014. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703, 2019. Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3288–3297, 2017. Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 183–192, 2020. Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust visual-semantic embeddings. In Proceedings of the IEEE International conference on Computer Vision, pages 3571–3580, 2017. Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8247–8255, 2019. Ming-Zhe Li, Zhen Jia, Zhang Zhang, Zhanyu Ma, and Liang Wang. Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. In Proceedings of International Conference on Image and Graphics, pages 68–80. Springer, 2023. Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, and Jiaqi Wang. Zero-shot skeleton-based action recognition via mutual information estimation and maximization. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5302–5310, 2023. OpenAI. Chatgpt: Mar 23 version. https://chat.openai.com, 2023. Anqi Zhu, Qiuhong Ke, Mingming Gong, and James Bailey. Part-aware unified representation of language and skeleton for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18761–18770, 2024. Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, and Ling Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia, page 778–786, 2024. Jeonghyeok Do and Munchurl Kim. TDSM: Triplet diffusion for skeleton-text matching in zero-shot action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, 2025. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019. Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of European Conference on Computer Vision (ECCV), 2016. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of International conference on machine learning, pages 8748–8763. PMLR, 2021. Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016. Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020. Liu Chunhui, Hu Yueyu, Li Yanghao, Song Sijie, and Liu Jiaying. PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475, 2017. Neel Trivedi, Anirudh Thatipelli, and Ravi Kiran Sarvadevabhatla. Ntu-x: An enhanced large-scale dataset for improving pose-based recognition of subtle human actions, 2021. Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019. Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Monocular expressive body regression through body-driven attention. In Proceedings of European Conference on Computer Vision (ECCV), 2020.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98695	-
dc.description.abstract	現有的零樣本骨架動作辨識方法大多依賴固定的類別標籤或通用的文字描述，導致骨架動作與語意理解之間的對齊效果受限。為了解決此問題，我們提出 Vision-augmented Skeleton-Text Alignment（ViSTA）架構，一種基於雙重變分自編碼器（Dual-VAE）的框架，藉由具備視覺理解能力的大型語言模型，從動畫化的骨架序列中生成以動作為核心的描述。這些視覺輔助的描述與原始類別標籤進行語意融合，並透過預訓練文字編碼器轉換為豐富的語意表示。ViSTA 採用雙重 VAE 結構解耦語意與非語意資訊，並結合跨模態重建與動量對比學習以強化模態對齊效果。與以 Dual-VAE 為基礎的原始方法相比，ViSTA 在 ZSL 設定下於 NTU-60、NTU-120 和 PKU-MMD 分別提升 +5.8%、+7.37% 和 +4.65% 的準確率，在 GZSL 設定下亦於三個資料集分別提升 +1.8%、+2.93%、與 +1.44% 的調和平均（harmonic mean）表現。	zh_TW
dc.description.abstract	Existing approaches to zero-shot skeleton-based action recognition often rely on fixed class labels or generic textual descriptions, which limits the alignment between skeletal motion and semantic understanding. To address this, we propose Vision-augmented Skeleton-Text Alignment (ViSTA), a dual-VAE framework that leverages a vision-language model to generate motion-centric descriptions from animated skeleton sequences. These vision-informed descriptions are fused with class labels and embedded via a pre-trained text encoder to form rich semantic representations. We disentangle semantic and irrelevant factors using dual VAEs and align the modalities through cross-reconstruction and momentum-based contrastive learning. Compared to a strong dual-VAE baseline, ViSTA improves ZSL accuracy by +5.8% on NTU-60, +7.37% on NTU-120, and +4.65% on PKU-MMD, and achieves gains of +1.8%, +2.93%, and +1.44% in GZSL harmonic mean on NTU-60, NTU-120, and PKU-MMD, respectively.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T16:08:03Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-18T16:08:03Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 ii Abstract iii Contents iv List of Figures vii List of Tables viii Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Thesis Organization 3 Chapter 2 Related Work 5 2.1 Early Foundations in Action Recognition 5 2.2 Skeleton-Based Action Recognition 6 2.3 Cross-Modal Embedding Foundations 7 2.4 Latent Alignment with Variational Autoencoders 7 2.5 Semantic Enrichment with Language Models 8 2.6 Momentum-Based Contrastive Learning 9 Chapter 3 Problem Definition 10 3.1 Skeleton-Based Action Recognition 10 3.2 Zero-Shot Skeleton-Based Action Recognition 11 3.3 Generalized Zero-Shot Skeleton-Based Action Recognition 12 Chapter 4 Methodology 13 4.1 Skeleton-to-GIF Visualization & Motion-Centric Captioning 13 4.1.1 Skeleton-to-GIF Rendering 14 4.1.2 Vision-LLM Caption Generation 14 4.1.3 Caption Verification and Embedding 15 4.1.4 Discussion and Rationale 15 4.2 Dual-VAE Cross-Modal Alignment with Contrastive Regularization 16 4.2.1 Feature Extraction and Dual-VAE Latent Representation 17 4.2.2 Memory Bank-based Contrastive Learning 19 4.2.3 Combined Objective 19 4.3 Zero-Shot Classification (ZSL) 20 4.4 Generalized Zero-Shot Classification (GZSL) 21 Chapter 5 Experiments 24 5.1 Datasets and Evaluation Protocol 24 5.1.1 Datasets 24 5.1.2 Evaluation Metrics 25 5.1.3 Feature Extraction 25 5.1.4 Performance Comparison to State-of-the-Art Models 26 5.2 Ablation Studies 28 5.3 Discussions 29 5.3.1 Semantic Embedding Visualization 29 5.3.2 Performance under High Unseen-Class Diversity and Semantic Fusion 31 5.4 Additional Analysis on Description Quality 34 5.4.1 Description Revision and Its Effect on Generalization 34 5.4.2 Comparison with Gemini-Generated Descriptions 36 5.5 Potential with Pose‑Estimated Skeleton Data 38 Chapter 6 Conclusion 40 6.1 Contribution 40 6.2 Limitation and Future Work 41 References 43 Appendix A — Vision-Language Prompt Design for Description Generation 48 A.1 Full Prompt for GPT-4o-Based Skeleton Captioning 48	-
dc.language.iso	en	-
dc.subject	零樣本學習	zh_TW
dc.subject	基於骨架之動作辨識	zh_TW
dc.subject	視覺語言模型	zh_TW
dc.subject	對比學習	zh_TW
dc.subject	多模態對齊	zh_TW
dc.subject	Multimodal Alignment	en
dc.subject	Zero-Shot Learning	en
dc.subject	Skeleton-Based Action Recognition	en
dc.subject	Vision-Language Model	en
dc.subject	Contrastive Learning	en
dc.title	基於視覺語言模型與記憶對比學習之零樣本骨架動作識別方法	zh_TW
dc.title	Vision-Augmented Skeleton-Text Alignment for Zero-Shot Action Recognition with Memory-Based Contrastive Learning	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	鄭文皇	zh_TW
dc.contributor.coadvisor	Wen-Huang Cheng	en
dc.contributor.oralexamcommittee	吳家麟;楊智淵;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Ja-Ling Wu;Chih-Yuan Yang;Jun-Cheng Chen	en
dc.subject.keyword	零樣本學習,基於骨架之動作辨識,視覺語言模型,對比學習,多模態對齊,	zh_TW
dc.subject.keyword	Zero-Shot Learning,Skeleton-Based Action Recognition,Vision-Language Model,Contrastive Learning,Multimodal Alignment,	en
dc.relation.page	49	-
dc.identifier.doi	10.6342/NTU202503431	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-08-11	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-08-19	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.33 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。