Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95986
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真zh_TW
dc.contributor.advisorJane Yung-jen Hsuen
dc.contributor.author李勝維zh_TW
dc.contributor.authorSheng-Wei Lien
dc.date.accessioned2024-09-25T16:28:46Z-
dc.date.available2024-09-26-
dc.date.copyright2024-09-25-
dc.date.issued2024-
dc.date.submitted2024-09-01-
dc.identifier.citationJ. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” Acm Computing Surveys (Csur), vol. 43, no. 3, pp. 1–43, 2011.
M. A. Rahim, J. Shin, and M. R. Islam, “Human-machine interaction based on hand gesture recognition using skeleton information of kinect sensor,” in Proceedings of the 3rd international conference on applications in information technology, 2018, pp. 75–79.
G. T. Papadopoulos, A. Axenopoulos, and P. Daras, “Real-time skeleton-tracking-based human action recognition using kinect data,” in MultiMedia Modeling: 20th Anniversary International Conference, MMM 2014, Dublin, Ireland, January 6-10, 2014, Proceedings, Part I 20. Springer, 2014, pp. 473–483.
K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.
Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, “Hrformer: High-resolution vision transformer for dense predict,” Advances in Neural Information Processing Systems, vol. 34, pp. 7281–7293, 2021.
Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012.
L. Keselman, J. Iselin Woodfill, A. Grunnet-Jepsen, and A. Bhowmik, “Intel realsense stereoscopic depth cameras,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 1–10.
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978.
S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
C. Plizzari, M. Cannici, and M. Matteucci, “Skeleton-based action recognition via spatial and temporal transformer networks,” Computer Vision and Image Understanding, vol. 208, p. 103219, 2021.
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019.
K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
D. S. Wishart, A. Guo, E. Oler, F. Wang, A. Anjum, H. Peters, R. Dizon, Z. Sayeeda, S. Tian, B. L. Lee et al., “Hmdb 5.0: the human metabolome database for 2022,” Nucleic acids research, vol. 50, no. D1, pp. D622–D631, 2022.
I. Y. Jung, “A review of privacy-preserving human and human activity recognition,” International Journal on Smart Sensing and Intelligent Systems, vol. 13, no. 1, pp. 1–13, 2020.
P. Gupta, D. Sharma, and R. K. Sarvadevabhatla, “Syntactically guided generative embeddings for zero-shot skeleton action recognition,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 439–443.
Y. Zhou, W. Qiang, A. Rao, N. Lin, B. Su, and J. Wang, “Zero-shot skeleton-based action recognition via mutual information estimation and maximization,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5302–5310.
F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,” IEEE transactions on pattern analysis and machine intelligence, 2022.
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in IJCAI’81: 7th international joint conference on Artificial intelligence, vol. 2, 1981, pp. 674–679.
E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462–2470.
L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, and M. J. Black, “On the integration of optical flow and action recognition,” in Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40. Springer, 2019, pp. 281–297.
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3288–3297.
L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 026–12 035.
P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1112–1121.
K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192.
Z. Han, Z. Fu, and J. Yang, “Learning the redundancy-free features for generalized zero-shot object recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 865–12 874.
E. Kodirov, T. Xiang, Z. Fu, and S. Gong, “Unsupervised domain adaptation for zero-shot learning,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2452–2460.
B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, “Rethinking zero-shot video classification: End-to-end training for realistic applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4613–4623.
X. Xu, T. M. Hospedales, and S. Gong, “Multi-task zero-shot action recognition with prioritised data augmentation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 2016, pp. 343–359.
Y.-H. Hubert Tsai, L.-K. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proceedings of the IEEE International conference on Computer Vision, 2017, pp. 3571–3580.
E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata, “Generalizedzero-and few-shot learning via aligned variational autoencoders,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8247–8255.
Y. Atzmon and G. Chechik, “Adaptive confidence smoothing for generalized zero-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 671–11 680.
Y. Bengio, “Deep learning of representations: Looking forward,” in International conference on statistical language and speech processing. Springer, 2013, pp. 1–37.
Z. Chen, Y. Luo, R. Qiu, S. Wang, Z. Huang, J. Li, and Z. Zhang, “Semantics disentangling for generalized zero-shot learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8712–8720.
Y. Gao, C. Tang, and J. Lv, “Cluster-based contrastive disentangling for generalized zero-shot learning,” arXiv preprint arXiv:2203.02648, 2022.
D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
M. Wray, D. Larlus, G. Csurka, and D. Damen, “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
M.-Z. Li, Z. Jia, Z. Zhang, Z. Ma, and L. Wang, “Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition,” in International Conference on Image and Graphics. Springer, 2023, pp. 68–80.
L. Batina, B. Gierlichs, E. Prouff, M. Rivain, F.-X. Standaert, and N. Veyrat-Charvillon, “Mutual information analysis: a comprehensive study,” Journal of Cryptology, vol. 24, no. 2, pp. 269–291, 2011.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95986-
dc.description.abstract在廣義零樣本基於骨架的動作識別中,現有方法通過特定模態的投影網絡學習骨架特徵和語義嵌入的共享潛在空間。然而,動作識別數據集中,骨架序列因樣本可變而類別標籤為恆定的非對稱性帶來了學習共享潛在空間時的重大挑戰。為了解決這一問題,我們引入了SMARTEN,一種基於對抗學習的特徵解耦方法,從骨架特徵中分離語義相關和無關的潛在變量,以更好地與語義嵌入對齊。利用特定模態的變分自編碼器(VAE)結合交叉重構損失,SMARTEN將語義相關的骨架特徵與語義嵌入對齊。我們的方法在零樣本和廣義零樣本動作識別中設立了新基準,在NTU RGB+D 60、NTU RGB+D 120和FineGym 99等數據集上顯示出顯著的改進。zh_TW
dc.description.abstractIn generalized zero-shot skeleton-based action recognition, existing approaches learn a shared latent space of skeleton features and semantic embeddings via modality-specific projection networks. However, the asymmetry in action recognition datasets, with variable skeleton sequences but constant class labels, poses significant challenges. Addressing this, we introduce SMARTEN, an adversarial-based feature disentanglement method separating semantic-related and unrelated latents from skeleton features for better alignment with semantic embeddings. Utilizing modality-specific variational autoencoders (VAEs) coupled with cross-reconstruction loss, SMARTEN adeptly aligns semantic-related skeleton features with semantic embeddings. Our approach sets new benchmarks in zero-shot and generalized zero-shot action recognition, demonstrating significant improvements over state-of-the-art methods on benchmark datasets such as NTU RGB+D 60, NTU RGB+D 120, and FineGym 99.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-25T16:28:46Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-09-25T16:28:46Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
Acknowledgments ii
摘要 iii
Abstract iv
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Proposed Method 3
1.4 Thesis Organization 4
2 Related Work 5
2.1 Action Recognition 5
2.1.1 RGB Videos 5
2.1.2 Optical Flows 6
2.1.3 Human Skeleton Representation 6
2.2 Zero-Shot Action Recognition 7
2.3 Generalized Zero-Shot Action Recognition 7
2.4 Feature Disentanglement in Generalized Zero-Shot Learning 8
3 Problem Definition 10
3.1 Zero-Shot Skeleton-Based Action Recognition 11
3.2 Generalized Zero-Shot Skeleton-Based Action Recognition 11
4 Methodology 12
4.1 Feature Extraction 12
4.2 Generative Cross-Modal Alignment and Disentanglement Module 14
4.2.1 Latent Representation 14
4.2.2 Feature Disentanglement and VAE Architecture 15
4.2.3 Adversarial Total Correlation Penalty 16
4.2.4 Cross-Alignment 16
4.3 Zero-Shot Classification 17
4.4 Generalized Zero-Shot Classification 18
5 Experiments 19
5.1 Evaluation Protocols 19
5.1.1 Datasets 19
5.1.2 Skeleton and Text Feature Extractors 20
5.1.3 Evaluation Metrics 20
5.2 Comparative Evaluation with State-of-the-Art Models 21
5.2.1 Zero-Shot Learning Results 22
5.2.2 Generalized Zero-Shot Learning Results 22
5.3 Assessment of Model with Rich Textual Descriptions 23
5.3.1 Zero-Shot Learning Analysis 24
5.3.2 Generalized Zero-Shot Learning Analysis 25
5.4 Analysis of Robustness Across Diverse Skeleton Feature Extractors 25
5.5 Robustness Evaluation on Datasets with Non-Standard Class Labels 27
5.6 Ablation Study 28
6 Conclusion 30
6.1 Contribution 30
6.2 Limitation and Future Work 31
Reference 32
-
dc.language.isoen-
dc.title語義對齊與特徵解離於廣義零樣本動作識別zh_TW
dc.titleSMARTEN: Semantic Alignment Through Feature Disentanglement For Generalized Zero-Shot Skeleton-Based Action Recognitionen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee楊智淵;王鈺強;陳駿丞zh_TW
dc.contributor.oralexamcommitteeChih-Yuan Yang;Yu-Chiang Frank Wang;Jun-Cheng Chenen
dc.subject.keyword零樣本學習,語義對齊,特徵解耦,基於骨架之動作識別,zh_TW
dc.subject.keywordZero-Shot Learning,Semantic Alignment,Feature Disentanglement,Skeleton-based Action Recognition,en
dc.relation.page38-
dc.identifier.doi10.6342/NTU202401280-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-09-03-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf3.08 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved