使用深度資訊之即時人體動作辨識系統演算法開發與架構設計

Chia-Jung Hsu; 許嘉容

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17898

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳良基(Liang-Gee Chen)
dc.contributor.author	Chia-Jung Hsu	en
dc.contributor.author	許嘉容	zh_TW
dc.date.accessioned	2021-06-08T00:45:43Z	-
dc.date.copyright	2015-07-31
dc.date.issued	2015
dc.date.submitted	2015-07-31
dc.identifier.citation	[1] C. Gu, P. Arbel aez, Y. Lin, K. Yu, and J. Malik, Multi-component models for object detection,' in Computer Vision–ECCV 2012, pp. 445{458, Springer, 2012. [2] R. Shapovalov, Object detection vs. semantic segmenta- tion.' http://computerblindness.blogspot.tw/2010/06/ object-detection-vs-semantic.html, 2014. [Online; accessed 20-May-2015]. [3] C. Arthur, Augmented reality: it's like real life, but bet- ter.' http://www.theguardian.com/technology/2010/mar/21/ augmented-reality-iphone-advertising, 2010. [Online; accessed 20-May-2015]. [4] S. Cangeloso, Self-driving cars could save lives, gas.' http://www. geek.com/geek-cetera/self-driving-cars-1447453/, 2011. [On- line; accessed 20-May-2015]. [5] W. Japan, Leave your chores to a robot.' http://web-japan.org/ trends/09_sci-tech/sci090327.html, 2009. [Online; accessed 20- May-2015]. [6] M. Munaro, F. Basso, and E. Menegatti, Tracking people within groups with rgb-d data,' in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 2101{2107, 2012. [7] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection,' Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 7, pp. 1409{1422, 2012. [8] Y. Tu, C.-L. Zeng, C.-H. Yeh, S.-Y. Huang, T.-X. Cheng, and M. Ouhy- oung, Real-time head pose estimation using depth map for avatar control,' 2011. [9] Qualcomm, Image targets \| vuforia developer portal.' https: //developer.vuforia.com/resources/dev-guide/image-targets, 2015. [Online; accessed 22-May-2015]. [10] D. Hanauer, Ringing rocks park, bucks county, pennsylvania.' http: //www.davidhanauer.com/buckscounty/ringingrocks/, 2013. [On- line; accessed 20-June-2013]. [11] Micrisoft, Kinect for windows, voice, movement, gesture recognition technology.' http://www.microsoft.com/en-us/ kinectforwindows/, 2015. [Online; accessed 22-May-2015]. [12] Asus, Xtion pro live - multimedia - asus.' http://www.asus.com/ Multimedia/Xtion_PRO_LIVE/, 2015. [Online; accessed 22-May-2015]. [13] LeapMotion, Leap motion.' https://www.leapmotion.com/, 2015. [Online; accessed 22-May-2015]. [14] Microsoft, Xbox one \| meet xbox one - xbox.com.' http://www. xbox.com/en-US/xbox-one?xr=shellnav, 2015. [Online; accessed 22- May-2015]. [15] V. Castaneda and N. Navab, Time-of- ight and kinect imaging.' http://campar.in.tum.de/twiki/pub/Chair/ TeachingSs11Kinect/2011-DSensors_LabCourse_Kinect.pdf, 2011. [Online; accessed 20-June-2013]. [16] A. Kolb, E. Barth, R. Koch, and R. Larsen, Time-of- ight sensors in computer graphics,' in Proc. Eurographics (State-of-the-Art Report), 2009. [17] D. Scharstein and R. Szeliski, High-accuracy stereo depth maps using structured light,' in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 1, pp. I{ 195, IEEE, 2003. [18] R. B. Rusu, Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. phd, Tecnische Universitatet Muenchen, Munich, Germany, 10/2009 2009. [19] M. Munaro, F. Basso, and E. Menegatti, Tracking people within groups with rgb-d data,' in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 2101{2107, 2012. [20] J. Aggarwal and L. Xia, Human activity recognition from 3d data: A review,' Pattern Recognition Letters, vol. 48, pp. 70{80, 2014. [21] X. Yang and Y. Tian, Eigenjoints-based action recognition using naive-bayes-nearest-neighbor,' in Computer Vision and Pattern Recog- nition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pp. 14{19, IEEE, 2012. [22] L. Xia and J. Aggarwal, Spatio-temporal depth cuboid similarity fea- ture for activity recognition using depth camera,' in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2834{ 2841, IEEE, 2013. [23] A. W. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu, and M. F. Cam- pos, Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences,' in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 252{259, Springer, 2012. [24] A. Swadzba, N. Beuter, J. Schmidt, and G. Sagerer, Tracking objects in 6d for reconstructing static scenes,' in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on, pp. 1{7, IEEE, 2008. [25] O. Oreifej and Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,' in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 716{723, IEEE, 2013. [26] W. Li, Z. Zhang, and Z. Liu, Action recognition based on a bag of 3d points,' in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 9{14, IEEE, 2010. [27] Wikipedia, Outline of object recognition.' http://en.wikipedia. org/wiki/Outline_of_object_recognition, 2014. [Online; accessed 20-May-2015]. [28] J. Xiao and L. Quan, Multiple view semantic segmentation for street view images,' in Computer Vision, 2009 IEEE 12th International Con- ference on, pp. 686{693, IEEE, 2009. [29] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, Segmen- tation and recognition using structure from motion point clouds,' in Computer Vision–ECCV 2008, pp. 44{57, Springer, 2008. [30] G. Heitz and D. Koller, Learning spatial context: Using stuﬀ to nd things,' in Computer Vision–ECCV 2008, pp. 30{43, Springer, 2008. [31] Wikipedia, Augmented reality.' http://en.wikipedia.org/wiki/ Augmented_reality, 2012. [Online; accessed 20-May-2015]. [32] WillowGarage, Pr2 overview.' http://www.willowgarage.com/ pages/pr2/overview, 2013. [Online; accessed 20-May-2015]. [33] S. Schroeder, This is what google's self-driving car 'sees' as it makes a turn.' http://mashable.com/2013/05/03/ google-self-driving-car-sees/, 2013. [Online; accessed 20- May-2015]. [34] Wikipedia, Google driverless car.' http://en.wikipedia.org/wiki/ Google_driverless_car, 2013. [Online; accessed 20-May-2015]. [35] S. Edelstein, Bmw activeassist system lets self-driving cars get side- ways and keep you on the road.' http://www.digitaltrends.com/ cars/bmw-activeassist-introduced-at-ces-2014/#!EkGS4, 2014. [Online; accessed 20-May-2015]. [36] P. C. Ng and S. Henikoﬀ, Sift: Predicting amino acid changes that af- fect protein function,' Nucleic acids research, vol. 31, no. 13, pp. 3812{ 3814, 2003. [37] H. Bay, T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust fea- tures,' in Computer Vision–ECCV 2006, pp. 404{417, Springer, 2006. [38] Qualcomm, Qualcomm vuforia.' http://www.qualcomm.com/ solutions/augmented-reality, 2015. [Online; accessed 21-May- 2015]. [39] Creative, Creative interactive gesture camera developer kit.' http:// us.creative.com/p/web-cameras/creative-senz3d, 2015. [Online; accessed 22-May-2015]. [40] FaceShift, faceshift \| face animation software: we put markerless motion capture at every desk.' http://www.faceshift.com/, 2015. [Online; accessed 22-May-2015]. [41] T. Weise, S. Bouaziz, H. Li, and M. Pauly, Realtime performance- based facial animation,' ACM Transactions on Graphics (Proceedings SIGGRAPH 2011), August 2011. [42] M. Munaro, F. Basso, and E. Menegatti, Tracking people within groups with rgb-d data,' in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 2101{2107, 2012. [43] J. Aggarwal and M. S. Ryoo, Human activity analysis: A review,' ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011. [44] L.-C. Chen, J.-W. Hsieh, C.-H. Chuang, C.-Y. Huang, and D.-Y. Chen, Occluded human action analysis using dynamic manifold model,' in Pattern Recognition (ICPR), 2012 21st International Conference on, pp. 1245{1248, IEEE, 2012. [45] D. Weinland, M. Ozuysal, and P. Fua, Making action recognition ro- bust to occlusions and viewpoint changes,' in Computer Vision–ECCV 2010, pp. 635{648, Springer, 2010. [46] M.-y. Chen and A. Hauptmann, Mosift: Recognizing human actions in surveillance videos,' 2009. [47] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, Real-time human pose recogni- tion in parts from single depth images,' Communications of the ACM, vol. 56, no. 1, pp. 116{124, 2013. [48] R. Poppe, A survey on vision-based human action recognition,' Image and vision computing, vol. 28, no. 6, pp. 976{990, 2010. [49] Wikipedia, Big data.' http://en.wikipedia.org/wiki/Big_data, 2015. [Online; accessed 24-May-2015]. [50] Wikipedia, Cloud computing.' http://en.wikipedia.org/wiki/ Cloud_computing, 2015. [Online; accessed 24-May-2015]. [51] Youtube, Statistics of youtube.' https://www.youtube.com/yt/ press/statistics.html, 2015. [Online; accessed 25-May-2015]. [52] Wikipedia, List of most popular websites.' http://en.wikipedia. org/wiki/List_of_most_popular_websites, 2015. [Online; accessed 25-May-2015]. [53] G. Johansson, Visual motion perception.,' Scientiﬁc American, 1975. [54] C. Zhang and Y. Tian, Rgb-d camera-based daily living activity recog- nition,' Journal of Computer Vision and Image Processing, vol. 2, no. 4, p. 12, 2012. [55] J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble for action recognition with depth cameras,' in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 1290{ 1297, IEEE, 2012. [56] A. Yao, J. Gall, G. Fanelli, and L. J. Van Gool, Does human action recognition bene t from pose estimation?.,' in BMVC, vol. 3, p. 6, 2011. [57] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, Two-person interaction detection using body-pose features and mul- tiple instance learning,' in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pp. 28{35, IEEE, 2012. [58] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, Evalua- tion of local spatio-temporal features for action recognition,' in BMVC 2009-British Machine Vision Conference, pp. 124{1, BMVA Press, 2009. [59] J. J. Gibson, The perception of the visual world.,' 1950. [60] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, Performance of opti- cal ow techniques,' International journal of computer vision, vol. 12, no. 1, pp. 43{77, 1994. [61] C. Cortes and V. Vapnik, Support-vector networks,' Machine learn- ing, vol. 20, no. 3, pp. 273{297, 1995. [62] M. Camplani and L. Salgado, Eﬃcient spatio-temporal hole lling strategy for kinect depth maps,' in IS&T/SPIE Electronic Imaging, pp. 82900E{82900E, International Society for Optics and Photonics, 2012. [63] P. Doll ar, V. Rabaud, G. Cottrell, and S. Belongie, Behavior recog- nition via sparse spatio-temporal features,' in Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65{72, IEEE, 2005. [64] I. Laptev, On space-time interest points,' International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107{123, 2005. [65] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Skubic, and S. Lao, Histogram of oriented normal vectors for object recognition with a depth sensor,' in Computer Vision–ACCV 2012, pp. 525{538, Springer, 2013. [66] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection,' in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886{893, IEEE, 2005. [67] H. S. M. Coxeter, Regular polytopes. Courier Corporation, 1973. [68] J. W. Davis and A. F. Bobick, The representation and recognition of human movement using temporal templates,' in Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pp. 928{934, IEEE, 1997. [69] B. Grunbaum, V. Klee, M. A. Perles, and G. C. Shephard, Convex polytopes. Springer, 1967. [70] X. Yang, C. Zhang, and Y. Tian, Recognizing actions using depth motion maps-based histograms of oriented gradients,' in Proceedings of the 20th ACM international conference on Multimedia, pp. 1057{ 1060, ACM, 2012. [71] Wikipedia, Newton's method.' https://en.wikipedia.org/ ?title=Newton%27s_method, 2015. [Online; accessed 20-June-2015]. [72] Wikipedia, Fast inverse square root.' https://en.wikipedia.org/ wiki/Fast_inverse_square_root, 2015. [Online; accessed 20-June- 2015]. [73] Wikipedia, Ieee754 single precision binary oating-point for- mat.' https://en.wikipedia.org/wiki/Single-precision_ floating-point_format#IEEE_754_single_precision_binary_ floating-point_format:_binary32[23], 2015. [Online; accessed 20-June-2015].
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17898	-
dc.description.abstract	電腦視覺的相關研究已經進行多年，並徹底地改變了每個人的生活。幸虧科技的進步，使我們進入巨量資料與智慧型裝置的時代。隨著電腦視覺的相關研究發展，相關的創新應用徹底改變了每個人的生活，使得我們的生活更加便捷與方便。電腦視覺的終極目標是發明一個智慧型機器人，使得機器人能像人類一般理解真實世界的資訊。而要達到此終極目標的第一步則是：使得機器能夠解讀動態影片背後所代表的實質意義。人體動作辨識的應用為機器人視覺最重要的基礎之一，動態影片蘊涵時空間的資訊，隨著深度感測器的發展，在日常生活中更易取得深度資訊，且深度資訊提供更多幾何形狀的資訊，使得動作辨識相關研究更往前邁進。在這篇論文中，我們提供使用深度資訊之即時人體動作辨識系統。我們呈現即時自動切割深度影片的方法，搭配全圖的法向量累計直方圖來描述深度影片。最後，我們呈現了一個全新並適合硬體實現的架構，包含特徵值擷取引擎以及更新累計直方圖引擎。根據實際運行時間的分析，特徵值擷取部分是最花費時間的，因此，我們使用不同化簡技巧實現了特徵值擷取，同時比較不同更新累計直方圖引擎的硬體架構，包括直接滑動累計直方圖、優化之直接滑動累計直方圖以及我們所提出的演算法。整體的來說，我們發展出了一個使用深度資訊且可以即時辨識人體動作的系統，同時我們提出可以減少記憶體用量以及頻寬的硬體架構。	zh_TW
dc.description.abstract	The ultimate goal of computer vision is to help computing devices understand the real world, process visual information efficiently, and even have semantic understandings like humans do. Nowadays, computer vision algorithms progressed rapidly, and developed plenty innovative applications. For example, intelligent environmental surveillances of the future are capable of monitoring real environments, including objects and people. Through the release of Kinect, 3D sequences become more accessible, and push researches forward to the ultimate goal. In the past few years, various methods have been proposed to solve the problem of human activity recognition from depth images. Compared with traditional 2D videos, depth sequences provide geometrical information, and therefore can better describe the scenes. In this thesis, we aim to provide an online action recognition system using 3D data. Since depth sequences are captured with a single commodity camera, noise and occlusion are common problems. In order to deal with these issues, we extract histogram of oriented 4D surface normal (HON4D) features, which can capture the joint shape-motion cues in the depth sequence. Moreover, we present an automatic segmentation method for online recognition of depth sequences. The overall framework is mainly separated into two parts, feature extraction engine, and histogram engine. According to our run-time profiling, feature extraction is the most time-consuming part. Therefore, HON4D feature extraction is implemented with several approximation techniques while maintaining its performance. Furthermore, we discuss three online action recognition architecture using HON4D features. These online action recognition architectures are based on direct sliding window, modified cell-based sliding window, and our proposed algorithm. In sum, we implement HON4D feature extraction to optimize the most time-consuming part in our proposed system. Furthermore, an online action recognition framework is proposed. Compared with other sliding window methods, our framework is favored for lower memory consumption, and also bandwidth.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T00:45:43Z (GMT). No. of bitstreams: 1 ntu-104-R02943039-1.pdf: 4615810 bytes, checksum: 7a892b9f5282e195d8fc32426d5e2521 (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	Abstract xi 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Applications of Computer Vision . . . . . . . . . . . . . . . 1 1.3 RGB-D Data in Computer Vision . . . . . . . . . . . . . . . 4 1.4 Design Considerations and Our Contributions . . . . . . . . 10 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 11 2 Analysis of Action Recognition Systems from 3D Data 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Challenges of Vision Based Human Action Recognition . . . 14 2.3 Motivation of Action Recognition from 3D Data . . . . . . . 15 2.4 Overview of Action Recognition Methods from 3D Data . . . 18 2.4.1 Recognition from 3D silhouettes . . . . . . . . . . . . 18 2.4.2 Recognition from skeletal joints or body parts tracking 20 2.4.3 Recognition using local spatio-temporal features . . . 22 2.4.4 Recognition using local 3D occupancy features . . . . 23 2.4.5 Recognition from 3D optical ow . . . . . . . . . . . 25 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Proposed Online Action Recognition System Using HON4D Features 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 HON4D Feature Extraction . . . . . . . . . . . . . . . . . . 31 3.3.1 Neutral Pose Extraction . . . . . . . . . . . . . . . . 33 3.3.2 The 4D Surface Normal . . . . . . . . . . . . . . . . 35 3.3.3 Histogram of 4D Normals . . . . . . . . . . . . . . . 37 3.4 Recognition System . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 Oﬄine Recognition . . . . . . . . . . . . . . . . . . . 39 3.4.2 Online Recognition . . . . . . . . . . . . . . . . . . . 39 3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.1 Oﬄine Recognition . . . . . . . . . . . . . . . . . . . 42 3.5.2 Online Recognition . . . . . . . . . . . . . . . . . . . 45 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 Architecture Design for Online Action Recognition System 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Proposed Architecture Design for Online Action Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 System Overview . . . . . . . . . . . . . . . . . . . . 54 4.3.2 Computation of 4D surface normals . . . . . . . . . . 54 4.3.3 Projection . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.4 Histogram of 4D normals . . . . . . . . . . . . . . . . 58 4.4 Implementation Results . . . . . . . . . . . . . . . . . . . . 62 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Conclusion 65 Bibliography 67
dc.language.iso	en
dc.title	使用深度資訊之即時人體動作辨識系統演算法開發與架構設計	zh_TW
dc.title	Algorithm and Architecture Design Using HON4D for Online Human Action Recognition	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	簡韶逸(Shao-Yi Chien),黃朝宗(Chao-Tsung Huang),賴永康(Yeong-Kang Lai)
dc.subject.keyword	即時人體動作辨識系統,線上辨識架構,全圖的法向量累計直方圖,機器學習,	zh_TW
dc.subject.keyword	Online action recognition framework,Histogram of oriented 4D normals,Machine learning,	en
dc.relation.page	75
dc.rights.note	未授權
dc.date.accepted	2015-07-31
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 目前未授權公開取用	4.51 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。