使用合成資料搭配領域適應學習無關視角姿勢特徵進行跨視角動作辨識

Yu-Huan Yang; 楊侑寰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7511

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成
dc.contributor.author	Yu-Huan Yang	en
dc.contributor.author	楊侑寰	zh_TW
dc.date.accessioned	2021-05-19T17:45:16Z	-
dc.date.available	2023-08-09
dc.date.available	2021-05-19T17:45:16Z	-
dc.date.copyright	2018-08-09
dc.date.issued	2018
dc.date.submitted	2018-08-08
dc.identifier.citation	[1] S. Herath, M. Harandi, and F. Porikli, 'Going deeper into action recognition: A survey,' Image and vision computing, vol. 60, pp. 4-21, 2017. [2] K. Simonyan and A. Zisserman, 'Very deep convolutional networks for large-scale image recognition,' in Proceedings of the International Conference on Learning Representations, 2015. [3] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep residual learning for image recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. [4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, 'You only look once: Unified, real-time object detection,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788. [5] J. Long, E. Shelhamer, and T. Darrell, 'Fully convolutional networks for semantic segmentation,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440. [6] S. Ren, K. He, R. Girshick, and J. Sun, 'Faster r-cnn: Towards real-time object detection with region proposal networks,' in Advances in neural information processing systems, 2015, pp. 91-99. [7] S. Ji, W. Xu, M. Yang, and K. Yu, '3D convolutional neural networks for human action recognition,' IEEE transactions on pattern analysis and machine intelligence, vol. 35, pp. 221-231, 2013. [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, 'Learning spatiotemporal features with 3d convolutional networks,' in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489-4497. [9] K. Simonyan and A. Zisserman, 'Two-stream convolutional networks for action recognition in videos,' in Advances in neural information processing systems, 2014, pp. 568-576. [10] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, 'Long-term recurrent convolutional networks for visual recognition and description,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634. [11] T.-W. Hsu, Y.-H. Yang, T.-H. Yeh, A.-S. Liu, L.-C. Fu, and Y.-C. Zeng, 'Privacy free indoor action detection system using top-view depth camera based on key-poses,' in IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2016, pp. 4058-4063. [12] O. Oreifej and Z. Liu, 'Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 716-723. [13] X. Yang and Y. Tian, 'Super normal vector for activity recognition using depth sequences,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 804-811. [14] E. Ohn-Bar and M. M. Trivedi, 'Joint angles similarities and HOG2 for action recognition,' in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2013, pp. 465-470. [15] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, 'NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [16] H. Rahmani and A. Mian, '3d action recognition from novel viewpoints,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1506-1515. [17] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, 'Learning from simulated and unsupervised images through adversarial training,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p. 6. [18] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid, 'Learning from synthetic humans,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [19] W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or, and B. Chen, 'Synthesizing training images for boosting human 3d pose estimation,' in International Conference on 3D Vision (3DV), 2016, pp. 479-488. [20] . CMU graphics lab motion capture database. Available: http://mocap.cs.cmu.edu/ [21] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, 'A theory of learning from different domains,' Machine learning, vol. 79, pp. 151-175, 2010. [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 'Generative adversarial nets,' in Advances in neural information processing systems, 2014, pp. 2672-2680. [23] I. Sutskever, O. Vinyals, and Q. V. Le, 'Sequence to sequence learning with neural networks,' in Advances in neural information processing systems, 2014, pp. 3104-3112. [24] S. Hochreiter and J. Schmidhuber, 'Long short-term memory,' Neural computation, vol. 9, pp. 1735-1780, 1997. [25] H. Rahmani, A. Mian, and M. Shah, 'Learning a deep model for human action recognition from novel viewpoints,' IEEE transactions on pattern analysis and machine intelligence, vol. 40, pp. 667-681, 2018. [26] H. Rahmani and A. Mian, 'Learning a non-linear knowledge transfer model for cross-view action recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2458-2466. [27] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, 'Histogram of oriented principal components for cross-view action recognition,' IEEE transactions on pattern analysis and machine intelligence, vol. 38, pp. 2430-2443, 2016. [28] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, 'Cross-view action modeling, learning and recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2649-2656. [29] A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, '3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2601-2608. [30] Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, 'Cross-view action recognition via a continuous virtual path,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2690-2697. [31] J. Zheng and Z. Jiang, 'Learning view-invariant sparse representations for cross-view action recognition,' in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 3176-3183. [32] R. Li and T. Zickler, 'Discriminative virtual views for cross-view action recognition,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2855-2862. [33] J. Liu, M. Shah, B. Kuipers, and S. Savarese, 'Cross-view action recognition via view knowledge transfer,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3209-3216. [34] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, 'Adversarial discriminative domain adaptation,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p. 4. [35] Y. Ganin and V. Lempitsky, 'Unsupervised domain adaptation by backpropagation,' in Proceedings of the International Conference on Machine Learning, 2015. [36] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, 'Simultaneous deep transfer across domains and tasks,' in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4068-4076. [37] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, 'A density-based algorithm for discovering clusters in large spatial databases with noise,' in Kdd, 1996, pp. 226-231. [38] R. J. Campello, D. Moulavi, and J. Sander, 'Density-based clustering based on hierarchical density estimates,' in Pacific-Asia conference on knowledge discovery and data mining, 2013, pp. 160-172. [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, 'Going Deeper With Convolutions,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015. [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, 'Rethinking the inception architecture for computer vision,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818-2826. [41] F. Chollet, 'Xception: Deep learning with depthwise separable convolutions,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. [42] C. Wang, Y. Wang, and A. L. Yuille, 'An approach to pose-based action recognition,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 915-922. [43] L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo, and P. Pala, 'Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses,' in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2013, pp. 479-485. [44] . MakeHuman: open source tool for making 3D characters. Available: http://www.makehuman.org/ [45] . Blender: a 3D modelling and rendering package. Available: http://www.blender.org/ [46] L. v. d. Maaten and G. Hinton, 'Visualizing data using t-SNE,' Journal of machine learning research, vol. 9, pp. 2579-2605, 2008. [47] Y.-C. Liu, W.-C. Chiu, S.-D. Wang, and Y.-C. F. Wang, 'Domain-Adaptive generative adversarial networks for sketch-to-photo inversion,' in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2017, pp. 1-6. [48] A. L. Maas, A. Y. Hannun, and A. Y. Ng, 'Rectifier nonlinearities improve neural network acoustic models,' in Proceedings of the International Conference on Machine Learning, 2013, p. 3. [49] V. Nair and G. E. Hinton, 'Rectified linear units improve restricted boltzmann machines,' in Proceedings of the International Conference on Machine Learning, 2010, pp. 807-814. [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, 'Dropout: A simple way to prevent neural networks from overfitting,' The Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014. [51] D. P. Kingma and J. Ba, 'Adam: A method for stochastic optimization,' in Proceedings of the International Conference on Learning Representations, 2015. [52] G. Gkioxari and J. Malik, 'Finding action tubes,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 759-768. [53] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, 'Action recognition by dense trajectories,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3169-3176. [54] B. Li, O. I. Camps, and M. Sznaier, 'Cross-view activity recognition using hankelets,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1362-1369. [55] Z. Luo, B. Peng, D.-A. Huang, A. Alahi, and L. Fei-Fei, 'Unsupervised learning of long-term motion dynamics for videos,' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [56] Z. Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, 'Human daily action analysis with multi-view and color-depth data,' in European Conference on Computer Vision (ECCV), 2012, pp. 52-61. [57] G. Evangelidis, G. Singh, and R. Horaud, 'Skeletal quads: Human action recognition using joint quadruples,' in International Conference on Pattern Recognition (ICPR), 2014, pp. 4513-4518. [58] R. Vemulapalli, F. Arrate, and R. Chellappa, 'Human action recognition by representing 3d skeletons as points in a lie group,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588-595. [59] Y. Du, W. Wang, and L. Wang, 'Hierarchical recurrent neural network for skeleton based action recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110-1118. [60] Z. Huang, C. Wan, T. Probst, and L. Van Gool, 'Deep learning on lie groups for skeleton-based action recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6099-6108. [61] J. Liu, A. Shahroudy, D. Xu, A. K. Chichung, and G. Wang, 'Skeleton-based action recognition using spatio-temporal lstm network with trust gates,' IEEE transactions on pattern analysis and machine intelligence, 2017. [62] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, 'An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data,' in Association for the Advancement of Artificial Intelligence (AAAI), 2017, p. 7. [63] L. Xia, C.-C. Chen, and J. Aggarwal, 'View invariant human action recognition using histograms of 3d joints,' in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 20-27. [64] J. Wang, Z. Liu, and Y. Wu, 'Learning actionlet ensemble for 3D human action recognition,' in Human Action Recognition with Depth Cameras, ed: Springer, 2014, pp. 11-40. [65] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, 'Deep multimodal feature analysis for action recognition in rgb+ d videos,' IEEE transactions on pattern analysis and machine intelligence, 2017. [66] H. Rahmani and M. Bennamoun, 'Learning action recognition model from depth and skeleton videos,' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5832-5841.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7511	-
dc.description.abstract	近年來，根據影片了解人們動作的技術獲得越來越多的關注，因為其有廣大的應用場域像是人機互動、智慧家庭、健康照顧以及監視系統。但是隨著視角的不同，人的輪廓外觀也會跟著不同，這造成了從未知的視角進行動作辨識仍然是個挑戰。在本論文中，我們學習了一個無關視角的姿勢特徵以進行跨視角動作辨識，另一方面，考慮到人們的隱私問題，我們捨棄了彩度影像而只採用深度影像當作我們系統的輸入。此提出的特徵提取模型運用深度卷積神經網路將來自不同視角的人物姿勢轉換到共享的特徵空間中，然而訓練深度模型需要龐大的多視角影像資料，人為蒐集和標注這樣的資料將會耗費不少的成本與精力，因此我們收集了一個合成的多視角姿勢資料庫，在模擬環境中我們將人體的立體幾何模型貼合到真實的動作捕捉資料上並且在虛擬環境中進行多視角的深度影像拍攝。我們以無監督的方式在所創造的合成資料庫上進行無關視角姿勢特徵的學習，此外，為了確保從合成資料到真實資料上的模型遷移性，我們採用了領域適應的技巧去降低彼此的領域差異性。一個動作可以視為一連串的姿勢序列所組成，藉由長短期記憶網路我們可以習得動作的時序模型。在實驗的部分，我們將所提出的方法實作在兩個公開的多視角動作資料庫，其表現超越了幾個基本比較模型，並且同時超越了許多當前最好的方法。	zh_TW
dc.description.abstract	Human action understanding from videos has raised lots of attention in computer vision recently because of its wide applications, such as human-robot interaction, smart home, health care, and surveillance systems. Recognizing human activities from unknown viewpoints is still a challenging problem since human shapes appear quite differently from different viewpoints. In this thesis, we learn a View-Invariant Pose (VIP) feature representation for cross- view action recognition. Besides, considering privacy issue, we adopt depth videos rather than RGB videos as input to our system. The proposed VIP feature encoder is a deep Convolutional Neural Network (CNN) that transfers human poses from different viewpoints to a shared high-level feature space. Learning such a deep model requires a large corpus of multi-view data which is very expensive to collect and label. Therefore, we synthesize a Multi-View Pose (MVP) dataset by fitting human physical models with real motion capture data in the simulators and then render depth images from multiple viewpoints. The VIP feature is learned from the synthetic MVP dataset in an unsupervised way. Moreover, domain adaptation is employed to ensure the transferability from synthetic data to real data such that the domain difference is minimized. An action can be considered as a sequence of poses and the temporal progress is learned and modeled by the Long Short-Term Memory (LSTM). In the experimental parts, our method is applied on two benchmark datasets of multi-view 3D human action and achieves superior performance when compared with baseline models as well as promising results when compared with several state-of-the-arts.	en
dc.description.provenance	Made available in DSpace on 2021-05-19T17:45:16Z (GMT). No. of bitstreams: 1 ntu-107-R05921001-1.pdf: 11569202 bytes, checksum: 18d7a183f71fa061f282be020d4a239e (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 I 摘要 II ABSTRACT III TABLE OF CONTENTS IV LIST OF FIGURES VI LIST OF TABLES X Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Literature Review 3 1.2.1 Action Recognition with Deep Neural Networks 3 1.2.2 Cross-View Action Recognition 4 1.2.3 Domain Adaptation 7 1.3 Contributions 8 1.4 Thesis Organization 8 Chapter 2 Preliminaries 10 2.1 Cluster Analysis and HDBSCAN 10 2.2 Convolutional Neural Network 14 2.2.1 Convolutional Layer 16 2.2.2 Xception Network 17 2.3 Recurrent Neural Network and Long Short-Term Memory 19 2.4 Generative Adversarial Network 21 2.5 Domain Adaptation 22 Chapter 3 Methodology 25 3.1 Synthesize a Multi-View Pose Dataset 25 3.1.1 Build a Pose Dictionary 26 3.1.2 Create 3D Human Models 29 3.1.3 Render Depth Images 31 3.2 Learn a View-Invariant Pose Feature 34 3.2.1 Unsupervised Learning 35 3.2.2 Adversarial Domain Adaptation 38 3.3 Model Temporal Information 43 Chapter 4 Experiments 45 4.1 Action Datasets 45 4.1.1 NTU RGB+D Action Recognition Dataset 45 4.1.2 UWA 3D Multi-View Activity II Dataset 47 4.2 Implementation Details 48 4.2.1 Synthesize a Multi-View Pose Dataset 49 4.2.2 Architecture Design 50 4.2.3 Training Details 51 4.3 Cross-View Pose Classification 53 4.4 Action Recognition Results 55 4.4.1 Action Recognition Pipeline 57 4.4.2 The Result of NTU RGB+D Action Recognition Dataset 58 4.4.3 The Result of UWA 3D Multi-View Activity II Dataset 59 Chapter 5 Conclusion and Future Works 62 REFERENCE 63
dc.language.iso	en
dc.title	使用合成資料搭配領域適應學習無關視角姿勢特徵進行跨視角動作辨識	zh_TW
dc.title	Cross-View Action Recognition Using View-Invariant Pose Feature Learned from Synthetic Data with Domain Adaptation	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	黃正民,王鈺強,廖弘源,范欽雄
dc.subject.keyword	動作辨識,跨視角,合成資料,領域適應,	zh_TW
dc.subject.keyword	action recognition,cross-view,synthetic data,domain adaptation,	en
dc.relation.page	67
dc.identifier.doi	10.6342/NTU201802626
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2018-08-08
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
dc.date.embargo-lift	2023-08-09	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	11.3 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。