語義對齊與特徵解離於廣義零樣本動作識別

李勝維; Sheng-Wei Li

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95986

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Jane Yung-jen Hsu	en
dc.contributor.author	李勝維	zh_TW
dc.contributor.author	Sheng-Wei Li	en
dc.date.accessioned	2024-09-25T16:28:46Z	-
dc.date.available	2024-09-26	-
dc.date.copyright	2024-09-25	-
dc.date.issued	2024	-
dc.date.submitted	2024-09-01	-
dc.identifier.citation	J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” Acm Computing Surveys (Csur), vol. 43, no. 3, pp. 1–43, 2011. M. A. Rahim, J. Shin, and M. R. Islam, “Human-machine interaction based on hand gesture recognition using skeleton information of kinect sensor,” in Proceedings of the 3rd international conference on applications in information technology, 2018, pp. 75–79. G. T. Papadopoulos, A. Axenopoulos, and P. Daras, “Real-time skeleton-tracking-based human action recognition using kinect data,” in MultiMedia Modeling: 20th Anniversary International Conference, MMM 2014, Dublin, Ireland, January 6-10, 2014, Proceedings, Part I 20. Springer, 2014, pp. 473–483. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703. Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, “Hrformer: High-resolution vision transformer for dense predict,” Advances in Neural Information Processing Systems, vol. 34, pp. 7281–7293, 2021. Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012. L. Keselman, J. Iselin Woodfill, A. Grunnet-Jepsen, and A. Bhowmik, “Intel realsense stereoscopic depth cameras,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 1–10. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. C. Plizzari, M. Cannici, and M. Matteucci, “Skeleton-based action recognition via spatial and temporal transformer networks,” Computer Vision and Image Understanding, vol. 208, p. 103219, 2021. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019. J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012. D. S. Wishart, A. Guo, E. Oler, F. Wang, A. Anjum, H. Peters, R. Dizon, Z. Sayeeda, S. Tian, B. L. Lee et al., “Hmdb 5.0: the human metabolome database for 2022,” Nucleic acids research, vol. 50, no. D1, pp. D622–D631, 2022. I. Y. Jung, “A review of privacy-preserving human and human activity recognition,” International Journal on Smart Sensing and Intelligent Systems, vol. 13, no. 1, pp. 1–13, 2020. P. Gupta, D. Sharma, and R. K. Sarvadevabhatla, “Syntactically guided generative embeddings for zero-shot skeleton action recognition,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 439–443. Y. Zhou, W. Qiang, A. Rao, N. Lin, B. Su, and J. Wang, “Zero-shot skeleton-based action recognition via mutual information estimation and maximization,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5302–5310. F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,” IEEE transactions on pattern analysis and machine intelligence, 2022. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014. B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in IJCAI’81: 7th international joint conference on Artificial intelligence, vol. 2, 1981, pp. 674–679. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462–2470. L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, and M. J. Black, “On the integration of optical flow and action recognition,” in Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40. Springer, 2019, pp. 281–297. Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3288–3297. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 026–12 035. P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1112–1121. K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192. Z. Han, Z. Fu, and J. Yang, “Learning the redundancy-free features for generalized zero-shot object recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 865–12 874. E. Kodirov, T. Xiang, Z. Fu, and S. Gong, “Unsupervised domain adaptation for zero-shot learning,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2452–2460. B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, “Rethinking zero-shot video classification: End-to-end training for realistic applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4613–4623. X. Xu, T. M. Hospedales, and S. Gong, “Multi-task zero-shot action recognition with prioritised data augmentation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 2016, pp. 343–359. Y.-H. Hubert Tsai, L.-K. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proceedings of the IEEE International conference on Computer Vision, 2017, pp. 3571–3580. E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata, “Generalizedzero-and few-shot learning via aligned variational autoencoders,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8247–8255. Y. Atzmon and G. Chechik, “Adaptive confidence smoothing for generalized zero-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 671–11 680. Y. Bengio, “Deep learning of representations: Looking forward,” in International conference on statistical language and speech processing. Springer, 2013, pp. 1–37. Z. Chen, Y. Luo, R. Qiu, S. Wang, Z. Huang, J. Li, and Z. Zhang, “Semantics disentangling for generalized zero-shot learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8712–8720. Y. Gao, C. Tang, and J. Lv, “Cluster-based contrastive disentangling for generalized zero-shot learning,” arXiv preprint arXiv:2203.02648, 2022. D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. M. Wray, D. Larlus, G. Csurka, and D. Damen, “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019. M.-Z. Li, Z. Jia, Z. Zhang, Z. Ma, and L. Wang, “Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition,” in International Conference on Image and Graphics. Springer, 2023, pp. 68–80. L. Batina, B. Gierlichs, E. Prouff, M. Rivain, F.-X. Standaert, and N. Veyrat-Charvillon, “Mutual information analysis: a comprehensive study,” Journal of Cryptology, vol. 24, no. 2, pp. 269–291, 2011.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95986	-
dc.description.abstract	在廣義零樣本基於骨架的動作識別中，現有方法通過特定模態的投影網絡學習骨架特徵和語義嵌入的共享潛在空間。然而，動作識別數據集中，骨架序列因樣本可變而類別標籤為恆定的非對稱性帶來了學習共享潛在空間時的重大挑戰。為了解決這一問題，我們引入了SMARTEN，一種基於對抗學習的特徵解耦方法，從骨架特徵中分離語義相關和無關的潛在變量，以更好地與語義嵌入對齊。利用特定模態的變分自編碼器（VAE）結合交叉重構損失，SMARTEN將語義相關的骨架特徵與語義嵌入對齊。我們的方法在零樣本和廣義零樣本動作識別中設立了新基準，在NTU RGB+D 60、NTU RGB+D 120和FineGym 99等數據集上顯示出顯著的改進。	zh_TW
dc.description.abstract	In generalized zero-shot skeleton-based action recognition, existing approaches learn a shared latent space of skeleton features and semantic embeddings via modality-specific projection networks. However, the asymmetry in action recognition datasets, with variable skeleton sequences but constant class labels, poses significant challenges. Addressing this, we introduce SMARTEN, an adversarial-based feature disentanglement method separating semantic-related and unrelated latents from skeleton features for better alignment with semantic embeddings. Utilizing modality-specific variational autoencoders (VAEs) coupled with cross-reconstruction loss, SMARTEN adeptly aligns semantic-related skeleton features with semantic embeddings. Our approach sets new benchmarks in zero-shot and generalized zero-shot action recognition, demonstrating significant improvements over state-of-the-art methods on benchmark datasets such as NTU RGB+D 60, NTU RGB+D 120, and FineGym 99.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-25T16:28:46Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-09-25T16:28:46Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i Acknowledgments ii 摘要 iii Abstract iv List of Figures viii List of Tables ix 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Thesis Organization 4 2 Related Work 5 2.1 Action Recognition 5 2.1.1 RGB Videos 5 2.1.2 Optical Flows 6 2.1.3 Human Skeleton Representation 6 2.2 Zero-Shot Action Recognition 7 2.3 Generalized Zero-Shot Action Recognition 7 2.4 Feature Disentanglement in Generalized Zero-Shot Learning 8 3 Problem Definition 10 3.1 Zero-Shot Skeleton-Based Action Recognition 11 3.2 Generalized Zero-Shot Skeleton-Based Action Recognition 11 4 Methodology 12 4.1 Feature Extraction 12 4.2 Generative Cross-Modal Alignment and Disentanglement Module 14 4.2.1 Latent Representation 14 4.2.2 Feature Disentanglement and VAE Architecture 15 4.2.3 Adversarial Total Correlation Penalty 16 4.2.4 Cross-Alignment 16 4.3 Zero-Shot Classification 17 4.4 Generalized Zero-Shot Classification 18 5 Experiments 19 5.1 Evaluation Protocols 19 5.1.1 Datasets 19 5.1.2 Skeleton and Text Feature Extractors 20 5.1.3 Evaluation Metrics 20 5.2 Comparative Evaluation with State-of-the-Art Models 21 5.2.1 Zero-Shot Learning Results 22 5.2.2 Generalized Zero-Shot Learning Results 22 5.3 Assessment of Model with Rich Textual Descriptions 23 5.3.1 Zero-Shot Learning Analysis 24 5.3.2 Generalized Zero-Shot Learning Analysis 25 5.4 Analysis of Robustness Across Diverse Skeleton Feature Extractors 25 5.5 Robustness Evaluation on Datasets with Non-Standard Class Labels 27 5.6 Ablation Study 28 6 Conclusion 30 6.1 Contribution 30 6.2 Limitation and Future Work 31 Reference 32	-
dc.language.iso	en	-
dc.title	語義對齊與特徵解離於廣義零樣本動作識別	zh_TW
dc.title	SMARTEN: Semantic Alignment Through Feature Disentanglement For Generalized Zero-Shot Skeleton-Based Action Recognition	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	楊智淵;王鈺強;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Chih-Yuan Yang;Yu-Chiang Frank Wang;Jun-Cheng Chen	en
dc.subject.keyword	零樣本學習,語義對齊,特徵解耦,基於骨架之動作識別,	zh_TW
dc.subject.keyword	Zero-Shot Learning,Semantic Alignment,Feature Disentanglement,Skeleton-based Action Recognition,	en
dc.relation.page	38	-
dc.identifier.doi	10.6342/NTU202401280	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2024-09-03	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf	3.08 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。