融入先驗知識以增強動作識別中跨領域少樣本學習能力

金明毅; Ming-Yi Chin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89018

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Yung-jen Hsu	en
dc.contributor.author	金明毅	zh_TW
dc.contributor.author	Ming-Yi Chin	en
dc.date.accessioned	2023-08-16T16:47:22Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-16	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-10	-
dc.identifier.citation	[1] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion: Pose motion representation for action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7024–7033, 2018. [2] A. Hernandez Ruiz, L. Porzi, S. Rota Bul`o, and F. Moreno-Noguer, “3d cnns on distance matrices for human action recognition,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 1087–1095, 2017. [3] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large scale benchmark for 3d human activity understanding,” vol. 42, pp. 2684–2701, IEEE, 2019. [4] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” arXiv preprint arXiv:1904.04232, 2019. [5] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” Acm Computing Surveys (Csur), vol. 43, no. 3, pp. 1–43, 2011. [6] M. Ziaeefard and R. Bergevin, “Semantic human activity recognition: A literature review,” Pattern Recognition, vol. 48, no. 8, pp. 2329–2345, 2015. [7] G. T. Papadopoulos, A. Axenopoulos, and P. Daras, “Real-time skeleton- tracking-based human action recognition using kinect data,” in MultiMedia Modeling: 20th Anniversary International Conference, MMM 2014, Dublin,Ireland, January 6-10, 2014, Proceedings, Part I 20, pp. 473–483, Springer, 2014. [8] S. N. Paul and Y. J. Singh, “Survey on video analysis of human walking motion,” International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 7, no. 3, pp. 99–122, 2014. [9] H. Rahmani, A. Mahmood, D. Q Huynh, and A. Mian, “Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzer- land, September 6-12, 2014, Proceedings, Part II 13, pp. 742–757, Springer, 2014. [10] J. Wan, Q. Ruan, W. Li, G. An, and R. Zhao, “3d smosift: three-dimensional sparse motion scale invariant feature transform for activity recognition from rgb-d videos,” Journal of Electronic Imaging, vol. 23, no. 2, pp. 023017–023017, 2014. [11] D. Das Dawn and S. H. Shaikh, “A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector,” The Visual Computer, vol. 32, pp. 289–306, 2016. [12] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017. [13] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” arXiv preprint arXiv:1406.2199, 2014. [14] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-temporal attention-based lstm networks for 3d action recognition and detection,” IEEE Transactions on image processing, vol. 27, no. 7, pp. 3459–3471, 2018. [15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017. [16] D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2616–2625, 2020. [17] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” advances in neural information processing systems, vol. 30, 2017. [18] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199– 1208, 2018. [19] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” Advances in neural information processing systems, vol. 27, 2014. [20] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning, pp. 1126–1135, PMLR, 2017. [21] G. Koch, R. Zemel, R. Salakhutdinov, et al., “Siamese neural networks for one- shot image recognition,” in ICML deep learning workshop, vol. 2, Lille, 2015. [22] S. Benaim and L. Wolf, “One-shot unsupervised cross domain translation,” advances in neural information processing systems, vol. 31, 2018. [23] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pp. 5693–5703, 2019. [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016. [25] W. Chen, C. Si, Z. Zhang, L. Wang, Z. Wang, and T. Tan, “Semantic prompt for few-shot image recognition,” arXiv preprint arXiv:2303.14123, 2023. [26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021. [27] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015. [28] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016. [29] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” arXiv preprint arXiv:1909.02729, 2019. [30] W.-H. Li, X. Liu, and H. Bilen, “Cross-domain few-shot learning with task- specific adapters,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7161–7170, 2022. [31] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257–267, 2001. [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. [33] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on com- puter vision, pp. 6202–6211, 2019. [34] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa- tiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015. [35] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450– 6459, 2018. [36] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pp. 483–499, Springer, 2016. [37] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intel- ligence, vol. 43, no. 10, pp. 3349–3364, 2020. [38] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978, 2022. [39] J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” arXiv preprint arXiv:2208.10741, 2022. [40] K. Cao, J. Ji, Z. Cao, C.-Y. Chang, and J. C. Niebles, “Few-shot video classification via temporal alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10618–10627, 2020. [41] T. Perrett, A. Masullo, T. Burghardt, M. Mirmehdi, and D. Damen, “Temporal-relational crosstransformers for few-shot action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 475–484, 2021. [42] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015. [43] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018. [44] A. Thatipelli, S. Narayan, S. Khan, R. M. Anwer, F. S. Khan, and B. Ghanem, “Spatio-temporal relation modeling for few-shot action recognition,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19958–19967, 2022. [45] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019, 2016.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89018	-
dc.description.abstract	行動識別是視頻理解中的關鍵領域，通常需要大量的訓練數據。為了解決這個問題，我們採用了少樣本學習方法。然而，這些方法主要設計用於同領域的場景，當應用到現實世界的跨領域情況時，會面臨挑戰。在本研究中，我們引入了一種新的數據表示方式——“軌跡 (Trajectory )”，以及一個“交叉相似性注意力 (CSA)塊”，它們都是基於行動識別的先驗知識，並且可以輕鬆地整合到現有的少樣本學習方法中。 “軌跡 (Trajectory )”方法是一種新的骨骼數據表示方式，利用空間信息來彌補由於視頻採樣導致的時間數據損失。這種方法使我們能夠使用更少的幀數和更少的計算資源達到與更多幀數相比的結果。 CSA塊利用骨骼數據的獨特特性來增強空間和時間相似性的比較，從而使度量學習能夠生成更好的嵌入。我們還將視覺提示學習整合到新領域適應的微調過程中。我們的方法不僅在開放數據集上表現出強大的性能，還在現實世界的場景中，如我們實驗室收集的AIMS項目中的嬰兒行為數據，也展現了出色的表現。這突顯了它在解決具有有限標籤數據的實際行動識別挑戰的實用應用性和潛力。	zh_TW
dc.description.abstract	Action recognition, a critical domain in video understanding, typically requires a substantial amount of training data. To address this, we employ few-shot learning methods. However, these methods, primarily designed for same-domain scenarios, face challenges when applied to real-world, cross-domain situations. In this study, we introduce a novel data representation, ’Trajectory’, and a ’Cross-Similarity Attention (CSA) Block’, both informed by prior knowledge specific to action recognition and easily integrated into existing few-shot learning methods. The ’Trajectory’ method, a new data representation for skeleton data, leverages spatial information to compensate for the temporal data loss due to video sampling. This approach allows us to achieve comparable results to those obtained with more frames but with fewer frames and less computational resources. The CSA Block utilizes the unique characteristics of skeleton data for enhanced comparison of spatial and temporal similarities, enabling metric learning to generate better embedding. We also incorporate visual prompt learning for fine-tuning during the adaptation to new domains. Our method demonstrates robust performance not only on open datasets but also in real-world scenarios, such as the infant action data collected in our lab’s AIMS project. This underlines its practical applicability and potential in addressing real-world action recognition challenges with limited labeled data.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T16:47:22Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-16T16:47:22Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Outline of the thesis 4 2 Literature Review 5 2.1 Few shot learning 5 2.1.1 Initialization based 5 2.1.2 Distance Metric Learning Based 6 2.1.3 Augment Dataset by PriorKnowledge 7 2.1.4 Cross-Domain FewShot Learning 7 2.2 ActionRecognition 8 2.2.1 RGBFrames 8 2.2.2 Optical Flows 9 2.2.3 Human Skeleton Data 9 2.2.4 Pose MoTion Representation for Action Recognition[1] 9 2.3 few shot learning on action recognition 10 3 Problem Definition 12 4 Methodology 14 4.1 Pilot Study: Exploring the Shortcomings of Existing Few-Shot Learn ing Techniques in ActionRecognition 15 4.2 The Trajectory - A Novel Data Representation for few shot learning in Action Recognition 17 4.2.1 Pose Extraction 18 4.2.2 Pose to 3D Heatmap 18 4.2.3 3D Heatmap to Trajectory 19 4.3 CROSS Similarity Attention Block 20 4.3.1 Spatio-temporal Cross Similarity 21 4.3.2 Feature Extraction 23 4.3.3 Feature Aggregation 24 4.3.4 Objective function 25 4.4 Cross domain adaptation 27 4.4.1 Fine-tuning the relation module head 27 4.4.2 Parameter Efficient Fine-tuning 27 4.4.3 Visual Prompt Tuning 28 5 Experiments 30 5.1 Experiment Setup 30 5.1.1 Datasets and scenarios 30 5.1.2 Implement detail 32 5.1.3 Evaluation Scenarios 33 5.1.4 Data Augmentation 35 5.1.5 Competitor Methods 35 5.1.6 Evaluation Metrics 36 5.2 Implementation Details 36 5.2.1 Baseline Method 36 5.2.2 Proposed Method 37 5.2.3 Meta-Testing Phase 37 5.3 Experiment Result 38 5.3.1 Performance Impact of Novel Data Representation ’Trajectory’ in Few-Shot Learning in Action Recognition Tasks 38 5.3.2 Performance Evaluation of Cross-Similarity Attention Block 39 5.4 Cross-domain Adaptation: Experimental Setup and Comparison 42 5.5 System-Level Comparison 46 5.6 AblationStudy 48 5.6.1 Investigating the Importance of Temporal Offset in the Cross Similarity Attention Block 48 5.6.2 Investigating the Importance of Spatial Offset in the Cross Similarity Attention Block 49 5.6.3 The Impact of Camera View on Few-Shot Learning in Action Recognition 50 5.6.4 Real-world Scenarios 52 6 Conclusion 54 6.1 Contribution 54 6.2 Future Study 55 Bibliography 57	-
dc.language.iso	en	-
dc.subject	視覺提示任務	zh_TW
dc.subject	跨領域學習	zh_TW
dc.subject	少樣本學習	zh_TW
dc.subject	動作識別	zh_TW
dc.subject	骨架資料	zh_TW
dc.subject	Few-Shot learning	en
dc.subject	Action Recognition	en
dc.subject	Skeleton Data	en
dc.subject	Visual Prompt Learning	en
dc.subject	Cross Domain	en
dc.title	融入先驗知識以增強動作識別中跨領域少樣本學習能力	zh_TW
dc.title	Incorporating Prior Knowledge to Enhance Cross-Domain Few-Shot Learning in Action Recognition	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	鄭素芳;鄭文皇;郭彥伶;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Suh-Fang Jeng;Wen-Huang Cheng;Yen-Ling Kuo;Jun-Cheng Chen	en
dc.subject.keyword	動作識別,少樣本學習,跨領域學習,視覺提示任務,骨架資料,	zh_TW
dc.subject.keyword	Action Recognition,Few-Shot learning,Cross Domain,Visual Prompt Learning,Skeleton Data,	en
dc.relation.page	62	-
dc.identifier.doi	10.6342/NTU202303375	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-08-11	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	3.08 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。