利用深度資訊與空間時間矩陣線上無關視點動作辨識

Yen-Pin Hsu; 許彥彬

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/56663

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成
dc.contributor.author	Yen-Pin Hsu	en
dc.contributor.author	許彥彬	zh_TW
dc.date.accessioned	2021-06-16T05:40:44Z	-
dc.date.available	2017-08-17
dc.date.copyright	2014-08-17
dc.date.issued	2014
dc.date.submitted	2014-08-12
dc.identifier.citation	[1] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004. [2] C. Huang, H. Ai, Y. Li, and S. Lao, “High-performance rotation invariant multiview face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 29, no. 4, pp. 671–686, 2007. [3] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, “Local gabor binary pattern histogram sequence (lgbphs): A novel non-statistical model for face representation and recognition,” in IEEE International Conference on Computer Vision (ICCV), vol. 1, 2005, pp. 786–791. [4] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 886–893. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no. 9, pp. 1627–1645, 2010. [6] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” International Journal of Computer Vision, vol. 61, no. 1, pp. 55–79, 2005. [7] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-ofparts,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1385–1392. [8] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Communications of the ACM, vol. 56, no. 1, pp. 116–124, 2013. [9] OpenNI, “http://www.openni.org/.” [10] J.-S. Tsai, Y.-P. Hsu, C. Liu, and L.-C. Fu, “An efficient part-based approach to action recognition from rgb-d video with bow-pyramid representation,” in IEEE International Conference on Intelligent Robots and Systems (IROS), 2013, pp. 2234–2239. [11] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 23, no. 3, pp. 257–267, 2001. [12] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recognition using motion history volumes,” Computer Vision and Image Understanding, vol. 104, no. 2, pp. 249–257, 2006. [13] C. Ellis, S. Z. Masood, M. F. Tappen, J. J. Laviola Jr, and R. Sukthankar, “Exploring the trade-off between accuracy and observational latency in action recognition,” International Journal of Computer Vision, vol. 101, no. 3, pp. 420–436, 2013. [14] J. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011. [15] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in IEEE International Conference on Computer Vision, vol. 2, 2005, pp. 1395–1402. [16] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 984–989. [17] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3169–3176. [18] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [19] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72. [20] T. Darrell and A. Pentland, “Space-time gestures,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 1993, pp. 335–340. [21] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,” in IEEE International Conference on Computer Vision (ICCV), 2003, pp. 726–733. [22] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 20–27. [23] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001. [24] X. Sun, M. Chen, and A. Hauptmann, “Action recognition via local descriptors and holistic features,” in IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2009, pp. 58–65. [25] ——, “Action recognition via local descriptors and holistic features,” in IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2009, pp. 58–65. [26] Y. Shen and H. Foroosh, “View-invariant action recognition from point triplets,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 31, no. 10, pp. 1898–1905, 2009. [27] C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation and recognition of actions,” International Journal of Computer Vision, vol. 50, no. 2, pp. 203–226, 2002. [28] V. Parameswaran and R. Chellappa, “View invariance for human action recognition,” International Journal of Computer Vision, vol. 66, no. 1, pp. 83–101, 2006. [29] Y. Zhang, K. Huang, Y. Huang, and T. Tan, “View-invariant action recognition using cross ratios across frames,” in IEEE International Conference on Image Processing (ICIP), 2009, pp. 3549–3552. [30] M.-C. Roh, H.-K. Shin, and S.-W. Lee, “View-independent human action recognition with volume motion template on single stereo camera,” Pattern Recognition Letters, vol. 31, no. 7, pp. 639–647, 2010. [31] D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbitrary views using 3d exemplars,” in IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–7. [32] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 20–27. [33] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Communications of the ACM, vol. 56, no. 1, pp. 116–124, 2013. [34] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez, “Cross-view action recognition from temporal self-similarities,” in European Conference on Computer Vision (ECCV). Springer, 2008. [35] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez, “View-independent action recognition from temporal self-similarities,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 33, no. 1, pp. 172–185, 2011. [36] J. Wang, C. Chen, and X. Zhu, “Free viewpoint action recognition based on selfsimilarities,” in IEEE International Conference on Signal Processing (ICSP), vol. 2, 2012, pp. 1131–1134. [37] J. Wang and H. Zheng, “View-robust action recognition based on temporal selfsimilarities and dynamic time warping,” in IEEE International Conference on Computer Science and Automation Engineering (CSAE), vol. 2, 2012, pp. 498–502. [38] B. D. Lucas, T. Kanade et al., “An iterative image registration technique with an application to stereo vision.” in IJCAI, vol. 81, 1981, pp. 674–679. [39] B. K. Horn and B. G. Schunck, “Determining optical flow,” in 1981 Technical Symposium East. International Society for Optics and Photonics, 1981, pp. 319–331. [40] Z. S. Harris, “Distributional structure.” Word, 1954. [41] G. Qiu, “Indexing chromatic and achromatic patterns for content-based colour image retrieval,” Pattern Recognition, vol. 35, no. 8, pp. 1675–1686, 2002. [42] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 281-297. California, USA, 1967, p. 14. [43] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992, pp. 144–152. [44] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995. [45] V. Parameswaran and R. Chellappa, “View invariance for human action recognition,” International Journal of Computer Vision, vol. 66, no. 1, pp. 83–101, 2006. [46] Y. Shen and H. Foroosh, “View-invariant action recognition using fundamental ratios,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–6. [47] A. S. Ogale, A. Karapurkar, and Y. Aloimonos, “View-invariant modeling and recognition of human actions using grammars,” in Dynamical vision. Springer, 2007, pp. 115–126. [48] A. Farhadi and M. K. Tabrizi, “Learning to recognize activities from the wrong view point,” in European Conference on Computer Vision (ECCV). Springer, 2008, pp. 154–166. [49] H. Zhang and L. E. Parker, “4-dimensional local spatio-temporal features for human activity recognition,” in IEEE International Conference on Intelligent Robots and Systems (IROS), 2011, pp. 2044–2049. [50] I. Laptev, “On space-time interest points,” International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107–123, 2005. [51] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European Conference on Computer Vision (ECCV). Springer, 2006, pp. 428–441. [52] M. J. Black, Y. Yacoob, and S. X. Ju, “Recognizing human motion using parameterized models of optical flow,” in Motion-Based Recognition. Springer, 1997, pp. 245–269. [53] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” International journal of computer vision, vol. 79, no. 3, pp. 299–318, 2008. [54] B. Ni, G. Wang, and P. Moulin, “Rgbd-hudaact: A color-depth video database for human daily activity recognition,” in Consumer Depth Cameras for Computer Vision. Springer, 2013, pp. 193–208. [55] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2847–2854. [56] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in IEEE International Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2001, pp. I–511. [57] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011. [58] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in IEEE International Conference on Computer Vision (ICCV), vol. 2, 2005, pp. 1395–1402. [59] D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbitrary views using 3d exemplars,” in International Conference on Computer Vision (ICCV), 2007, pp. 1–7. [60] H. Wang, C. Yuan, W. Hu, and C. Sun, “Supervised class-specific dictionary learning for sparse modeling in action recognition,” Pattern Recognition, vol. 45, no. 11, pp. 3902–3911, 2012.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/56663	-
dc.description.abstract	近年來，動作辨識是影像視覺領域熱門的研究主題。為了使系統能夠以最貼近人類，最自然的方式來解讀精細且複雜的動作，我們採取視覺為基礎來設計系統；人類在辨識他人的肢體動作時，不一定要從表演者的正前方，只要能夠獲取足夠的視覺資訊，可以從各個視點去辨識。因此，在本篇論文中，我們的目標為建造出一個以視覺為基礎的動作辨識系統，此系統可以不受視點的影響，在獲得足夠的肢體資訊下皆可有效的分辨人類的動作。為了達到此目的，我們引用了自身相似(Self-Similarity)的概念。不同的視點即使做相同的動作，因為所看到的實際畫面不同，會萃取出不同的特徵，因此不同以往的直接使用萃取之特徵建立模型，我們計算所有幀與幀之間的特徵距離存取在一矩陣中稱之為自身相似矩陣(Self-Similarity Matrix)，我們進一步將此矩陣切割成多個子矩陣。接著利用我們提出的時間金字塔詞袋 (Temporal-Pyramid Bag-of-Words)來表示各個子矩陣，並利用所有子矩陣的金字塔詞袋來表示一個動作。我們將時間金字塔詞袋做為輸入向量訓練出一支持向量機藉此達到無關視角動作辨識之目的。	zh_TW
dc.description.abstract	Understanding human action has drawn attention to the field of computer vision. We choose vision-based system so that computer system can understand human actions naturally. When people are recognizing actions of other people, the actors do not have to stand right in front of the observer. Therefore, in this thesis, we aim to build a vision-based action recognition system which is invariant to the viewpoint. To achieve this goal, we include the idea of self-similarity. When two video sequences record a specific action from various camera views, the resulting appearances of actions would be entirely different. Consequently, if we simply apply feature extraction methods to the raw video, we will end up getting totally different features. Instead of doing the extraction of spatio-temporal feature for every frame and using these feature vectors directly, our study uses the Euclidean distance between feature vectors that are represented in a Self-Similarity Matrix (SSM). To recognize the action, we describe the local tendency of the SSM using pyramid-structural bag-of-words and train a Support-Vector Machine as our classifier. Extensive experiments have been conducted to validate the proposed action recognition system.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T05:40:44Z (GMT). No. of bitstreams: 1 ntu-103-R01922124-1.pdf: 24928943 bytes, checksum: 8a3be5af22dbd10c507ad311e4f46fb0 (MD5) Previous issue date: 2014	en
dc.description.tableofcontents	致謝 i Acknowledgements ii 摘要 iii Abstract iv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Dealing with Perspective of Camera View . . . . . . . . . . . . . 8 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Preliminaries 11 2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Histogram of Oriented Gradient . . . . . . . . . . . . . . . . . . 12 2.1.2 Histogram of Optical Flow . . . . . . . . . . . . . . . . . . . . . 13 v 2.2 Bag-of-Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Codebook Generation . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Histogram of Codewords . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Soft Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 Nonlinear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Feature Extraction and Self Similarity 25 3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Spatio-Temporal Feature Extraction . . . . . . . . . . . . . . . . . . . . 28 3.3 Spatio-Temporal Self-Similarity Matrix . . . . . . . . . . . . . . . . . . 31 3.3.1 Self-Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 Spatio-Temporal Self-Similarity Matrix . . . . . . . . . . . . . . 32 3.4 Structural Stability of SSM across views . . . . . . . . . . . . . . . . . . 33 4 SSM-Based Action Description and Action Recognition 36 4.1 Local Feature Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Temporal Pyramid Bag-of-Word Representation . . . . . . . . . . . . . . 39 4.3 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Off-line Training . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 On-line Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Experiments 46 5.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 vi 5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 Weizmann Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 IXMAS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.3 ViData Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.1 Temporal-Pyramid Bag-of-Words Evaluation . . . . . . . . . . . 51 5.3.2 View-Invariant Action Recognition Performance . . . . . . . . . 55 5.3.3 Action Spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.4 Computational Cost Evaluation . . . . . . . . . . . . . . . . . . 59 6 Conclusion and Future Work 61 Reference 63
dc.language.iso	en
dc.subject	無關視點	zh_TW
dc.subject	自身相似	zh_TW
dc.subject	動作辨識	zh_TW
dc.subject	Self-Similarity	en
dc.subject	View-Invariant	en
dc.subject	Action Recognition	en
dc.title	利用深度資訊與空間時間矩陣線上無關視點動作辨識	zh_TW
dc.title	Online View-invariant Human Action Recognition Using RGB-D Spatio-temporal Matrix	en
dc.type	Thesis
dc.date.schoolyear	102-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李蔡彥,陳祝嵩,黃正民,洪一平
dc.subject.keyword	動作辨識,無關視點,自身相似,	zh_TW
dc.subject.keyword	Action Recognition,View-Invariant,Self-Similarity,	en
dc.relation.page	70
dc.rights.note	有償授權
dc.date.accepted	2014-08-12
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-103-1.pdf 未授權公開取用	24.34 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。