Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89018
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真zh_TW
dc.contributor.advisorYung-jen Hsuen
dc.contributor.author金明毅zh_TW
dc.contributor.authorMing-Yi Chinen
dc.date.accessioned2023-08-16T16:47:22Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-16-
dc.date.issued2023-
dc.date.submitted2023-08-10-
dc.identifier.citation[1] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion: Pose motion representation for action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7024–7033, 2018.
[2] A. Hernandez Ruiz, L. Porzi, S. Rota Bul`o, and F. Moreno-Noguer, “3d cnns on distance matrices for human action recognition,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 1087–1095, 2017.
[3] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large scale benchmark for 3d human activity understanding,” vol. 42, pp. 2684–2701, IEEE, 2019.
[4] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” arXiv preprint arXiv:1904.04232, 2019.
[5] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” Acm Computing Surveys (Csur), vol. 43, no. 3, pp. 1–43, 2011.
[6] M. Ziaeefard and R. Bergevin, “Semantic human activity recognition: A literature review,” Pattern Recognition, vol. 48, no. 8, pp. 2329–2345, 2015.
[7] G. T. Papadopoulos, A. Axenopoulos, and P. Daras, “Real-time skeleton- tracking-based human action recognition using kinect data,” in MultiMedia Modeling: 20th Anniversary International Conference, MMM 2014, Dublin,Ireland, January 6-10, 2014, Proceedings, Part I 20, pp. 473–483, Springer, 2014.
[8] S. N. Paul and Y. J. Singh, “Survey on video analysis of human walking motion,” International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 7, no. 3, pp. 99–122, 2014.
[9] H. Rahmani, A. Mahmood, D. Q Huynh, and A. Mian, “Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzer- land, September 6-12, 2014, Proceedings, Part II 13, pp. 742–757, Springer, 2014.
[10] J. Wan, Q. Ruan, W. Li, G. An, and R. Zhao, “3d smosift: three-dimensional sparse motion scale invariant feature transform for activity recognition from rgb-d videos,” Journal of Electronic Imaging, vol. 23, no. 2, pp. 023017–023017, 2014.
[11] D. Das Dawn and S. H. Shaikh, “A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector,” The Visual Computer, vol. 32, pp. 289–306, 2016.
[12] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
[13] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” arXiv preprint arXiv:1406.2199, 2014.
[14] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-temporal attention-based lstm networks for 3d action recognition and detection,” IEEE Transactions on image processing, vol. 27, no. 7, pp. 3459–3471, 2018.
[15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
[16] D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2616–2625, 2020.
[17] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” advances in neural information processing systems, vol. 30, 2017.
[18] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199– 1208, 2018.
[19] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” Advances in neural information processing systems, vol. 27, 2014.
[20] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning, pp. 1126–1135, PMLR, 2017.
[21] G. Koch, R. Zemel, R. Salakhutdinov, et al., “Siamese neural networks for one- shot image recognition,” in ICML deep learning workshop, vol. 2, Lille, 2015.
[22] S. Benaim and L. Wolf, “One-shot unsupervised cross domain translation,” advances in neural information processing systems, vol. 31, 2018.
[23] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pp. 5693–5703, 2019.
[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
[25] W. Chen, C. Si, Z. Zhang, L. Wang, Z. Wang, and T. Tan, “Semantic prompt for few-shot image recognition,” arXiv preprint arXiv:2303.14123, 2023.
[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
[27] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
[28] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016.
[29] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” arXiv preprint arXiv:1909.02729, 2019.
[30] W.-H. Li, X. Liu, and H. Bilen, “Cross-domain few-shot learning with task- specific adapters,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7161–7170, 2022.
[31] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257–267, 2001.
[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
[33] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on com- puter vision, pp. 6202–6211, 2019.
[34] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa- tiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
[35] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450– 6459, 2018.
[36] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pp. 483–499, Springer, 2016.
[37] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intel- ligence, vol. 43, no. 10, pp. 3349–3364, 2020.
[38] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978, 2022.
[39] J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” arXiv preprint arXiv:2208.10741, 2022.
[40] K. Cao, J. Ji, Z. Cao, C.-Y. Chang, and J. C. Niebles, “Few-shot video classification via temporal alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10618–10627, 2020.
[41] T. Perrett, A. Masullo, T. Burghardt, M. Mirmehdi, and D. Damen, “Temporal-relational crosstransformers for few-shot action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 475–484, 2021.
[42] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[43] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[44] A. Thatipelli, S. Narayan, S. Khan, R. M. Anwer, F. S. Khan, and B. Ghanem, “Spatio-temporal relation modeling for few-shot action recognition,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19958–19967, 2022.
[45] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019, 2016.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89018-
dc.description.abstract行動識別是視頻理解中的關鍵領域,通常需要大量的訓練數據。為了解決 這個問題,我們採用了少樣本學習方法。然而,這些方法主要設計用於同領域的 場景,當應用到現實世界的跨領域情況時,會面臨挑戰。在本研究中,我們引入 了一種新的數據表示方式——“軌跡 (Trajectory )”,以及一個“交叉相似性注意力 (CSA)塊”,它們都是基於行動識別的先驗知識,並且可以輕鬆地整合到現有的 少樣本學習方法中。
“軌跡 (Trajectory )”方法是一種新的骨骼數據表示方式,利用空間信息來彌補 由於視頻採樣導致的時間數據損失。這種方法使我們能夠使用更少的幀數和更少 的計算資源達到與更多幀數相比的結果。 CSA塊利用骨骼數據的獨特特性來增強 空間和時間相似性的比較,從而使度量學習能夠生成更好的嵌入。
我們還將視覺提示學習整合到新領域適應的微調過程中。我們的方法不僅在 開放數據集上表現出強大的性能,還在現實世界的場景中,如我們實驗室收集 的AIMS項目中的嬰兒行為數據,也展現了出色的表現。這突顯了它在解決具有有 限標籤數據的實際行動識別挑戰的實用應用性和潛力。
zh_TW
dc.description.abstractAction recognition, a critical domain in video understanding, typically requires a substantial amount of training data. To address this, we employ few-shot learning methods. However, these methods, primarily designed for same-domain scenarios, face challenges when applied to real-world, cross-domain situations. In this study, we introduce a novel data representation, ’Trajectory’, and a ’Cross-Similarity Attention (CSA) Block’, both informed by prior knowledge specific to action recognition and easily integrated into existing few-shot learning methods.
The ’Trajectory’ method, a new data representation for skeleton data, leverages spatial information to compensate for the temporal data loss due to video sampling. This approach allows us to achieve comparable results to those obtained with more frames but with fewer frames and less computational resources.
The CSA Block utilizes the unique characteristics of skeleton data for enhanced comparison of spatial and temporal similarities, enabling metric learning to generate better embedding.
We also incorporate visual prompt learning for fine-tuning during the adaptation to new domains. Our method demonstrates robust performance not only on open datasets but also in real-world scenarios, such as the infant action data collected in our lab’s AIMS project. This underlines its practical applicability and potential in addressing real-world action recognition challenges with limited labeled data.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T16:47:22Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-16T16:47:22Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Proposed Method 3
1.4 Outline of the thesis 4
2 Literature Review 5
2.1 Few shot learning 5
2.1.1 Initialization based 5
2.1.2 Distance Metric Learning Based 6
2.1.3 Augment Dataset by PriorKnowledge 7
2.1.4 Cross-Domain FewShot Learning 7
2.2 ActionRecognition 8
2.2.1 RGBFrames 8
2.2.2 Optical Flows 9
2.2.3 Human Skeleton Data 9
2.2.4 Pose MoTion Representation for Action Recognition[1] 9
2.3 few shot learning on action recognition 10
3 Problem Definition 12
4 Methodology 14
4.1 Pilot Study: Exploring the Shortcomings of Existing Few-Shot Learn ing Techniques in ActionRecognition 15
4.2 The Trajectory - A Novel Data Representation for few shot learning in Action Recognition 17
4.2.1 Pose Extraction 18
4.2.2 Pose to 3D Heatmap 18
4.2.3 3D Heatmap to Trajectory 19
4.3 CROSS Similarity Attention Block 20
4.3.1 Spatio-temporal Cross Similarity 21
4.3.2 Feature Extraction 23
4.3.3 Feature Aggregation 24
4.3.4 Objective function 25
4.4 Cross domain adaptation 27
4.4.1 Fine-tuning the relation module head 27
4.4.2 Parameter Efficient Fine-tuning 27
4.4.3 Visual Prompt Tuning 28
5 Experiments 30
5.1 Experiment Setup 30
5.1.1 Datasets and scenarios 30
5.1.2 Implement detail 32
5.1.3 Evaluation Scenarios 33
5.1.4 Data Augmentation 35
5.1.5 Competitor Methods 35
5.1.6 Evaluation Metrics 36
5.2 Implementation Details 36
5.2.1 Baseline Method 36
5.2.2 Proposed Method 37
5.2.3 Meta-Testing Phase 37
5.3 Experiment Result 38
5.3.1 Performance Impact of Novel Data Representation ’Trajectory’ in Few-Shot Learning in Action Recognition Tasks 38
5.3.2 Performance Evaluation of Cross-Similarity Attention Block 39
5.4 Cross-domain Adaptation: Experimental Setup and Comparison 42
5.5 System-Level Comparison 46
5.6 AblationStudy 48
5.6.1 Investigating the Importance of Temporal Offset in the Cross Similarity Attention Block 48
5.6.2 Investigating the Importance of Spatial Offset in the Cross Similarity Attention Block 49
5.6.3 The Impact of Camera View on Few-Shot Learning in Action Recognition 50
5.6.4 Real-world Scenarios 52
6 Conclusion 54
6.1 Contribution 54
6.2 Future Study 55
Bibliography 57
-
dc.language.isoen-
dc.subject視覺提示任務zh_TW
dc.subject跨領域學習zh_TW
dc.subject少樣本學習zh_TW
dc.subject動作識別zh_TW
dc.subject骨架資料zh_TW
dc.subjectFew-Shot learningen
dc.subjectAction Recognitionen
dc.subjectSkeleton Dataen
dc.subjectVisual Prompt Learningen
dc.subjectCross Domainen
dc.title融入先驗知識以增強動作識別中跨領域少樣本學習能力zh_TW
dc.titleIncorporating Prior Knowledge to Enhance Cross-Domain Few-Shot Learning in Action Recognitionen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee鄭素芳;鄭文皇;郭彥伶;陳駿丞zh_TW
dc.contributor.oralexamcommitteeSuh-Fang Jeng;Wen-Huang Cheng;Yen-Ling Kuo;Jun-Cheng Chenen
dc.subject.keyword動作識別,少樣本學習,跨領域學習,視覺提示任務,骨架資料,zh_TW
dc.subject.keywordAction Recognition,Few-Shot learning,Cross Domain,Visual Prompt Learning,Skeleton Data,en
dc.relation.page62-
dc.identifier.doi10.6342/NTU202303375-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2023-08-11-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf3.08 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved