以全域性空間時間特徵輔以序列式生成機制完成多工之動作辨識及產生光流影像

Tso-Hsin Yeh; 葉佐新

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76404

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成(Li-Chen Fu)
dc.contributor.author	Tso-Hsin Yeh	en
dc.contributor.author	葉佐新	zh_TW
dc.date.accessioned	2021-07-09T15:51:53Z	-
dc.date.available	2023-08-21
dc.date.copyright	2018-08-21
dc.date.issued	2018
dc.date.submitted	2018-08-10
dc.identifier.citation	[1] Ivan Laptev and Tony Lindeberg. Space-time interest points. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 432–439. IEEE, 2003. [2] Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. CVPR, 2017. [3] X. Yan, S. Hu, and Y. Ye. Multi-task clustering of human actions by sharing in- formation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4049–4057, July 2017. [4] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Ac- tionVLAD: Learning spatio-temporal aggregation for action classification. In CVPR, 2017. [5] Y. Wang, M. Long, J. Wang, and P. S. Yu. Spatiotemporal pyramid network for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2097–2106, July 2017. [6] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7445–7454, July 2017. [7] Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam. Deep local video feature for ac- tion recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recog- nition Workshops (CVPRW), pages 1219–1225, July 2017. [8] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Process- ing Systems (NIPS), pages 3468–3476, 2016. [9] I. C. Duta, B. Ionescu, K. Aizawa, and N. Sebe. Spatio-temporal vector of locally max pooled features for action recognition in videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, pages 3205–3214, July 2017. [10]D.Tran,L.Bourdev,R.Fergus,L.Torresani,andM.Paluri.Learningspatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, Dec 2015. [11] Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [12] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? arXiv preprint, arXiv:1711.09577, 2017. [13] Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S. Davis. Actionflownet: Learning motion representation for action recognition. CoRR, abs/1612.03052, 2016. [14] A. Burton and J. Radford. Thinking in Perspective: Critical Essays in the Study of Thought Processes. Psychology in progress. Methuen, 1978. [15] D.H. Warren and E.R. Strelow. Electronic Spatial Sensing for the Blind: Contribu- tions from Perception, Rehabilitation, and Computer Vision. Nato Science Series E:. Springer Netherlands, 1985. [16] Jean yves Bouguet. Pyramidal implementation of the lucas kanade feature tracker. Intel Corporation, Microprocessor Research Labs, 2000. [17] Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Proceedings of the 13th Scandinavian Conference on Image Analysis, SCIA’03, pages 363–370, Berlin, Heidelberg, 2003. Springer-Verlag. [18] Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 2758–2766, Washington, DC, USA, 2015. IEEE Computer Society. [19] Pierre Baldi. Autoencoders, unsupervised learning and deep architectures. In Pro- ceedings of the 2011 International Conference on Unsupervised and Transfer Learn- ing Workshop - Volume 27, UTLW’11, pages 37–50. JMLR.org, 2011. [20] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017. [21] Anurag Ranjan and Michael Black. Optical flow estimation using a spatial pyra- mid network. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Piscataway, NJ, USA, July 2017. IEEE. [22] Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexander G. Hauptmann. Guided Optical Flow Learning. arXiv preprint arXiv:1702.022952, 2017. [23] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Process- ing Systems 27, pages 568–576. Curran Associates, Inc., 2014. [24] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two- stream network fusion for video action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [25] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3304–3311, June 2010. [26] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for ac- curate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, June 2014. [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016. [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 630–645. Springer, 2016. [29] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Joint learning of object and action detectors. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2001–2010, 2017. [30] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep net- work training by reducing internal covariate shift. In Proceedings of the 32Nd In- ternational Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456. JMLR.org, 2015. [31] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dande- lion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large- scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [32] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. [33] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer- Verlag, October 2012. [34]OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,SanjeevSatheesh,SeanMa, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. Inter- national Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76404	-
dc.description.abstract	在本篇中,我們提出了一個嶄新的深度學習架構來做動作辨識,近期的深度學習研究議題中,動作辨識是一個越來越重要的領域,深度學習方法已經被廣泛地運用且有能力產生泛化的模型,目前大部分存在的方法不是使用 Two-Stream 就是使用 3D ConvNet 的方法,前者使用了單張彩色影像與多張疊在一起的光流影像當作架構的輸入,而後者則是將輸入多張疊在一起的彩色影像,但會花較大時間及記憶體上的代價,本篇提出的 ResFlow 使用一個光流的數據庫以得到預處理的模型,並且在動作的數據庫上作微調處理,此模型可視為一個提取高維度特徵的模組來做動作辨識,在第一階段中,使用光流數據庫當作一個預先學習的基礎,整合空間時間的特徵透過自動編碼機得架構會從中間的高維度空間中被提取出來,在微調階段中透過影片中分解出的影像可以得到一組區域性整合空間時間的特徵,並且利用設計的序列式機制,可以得到每一個區域性整合空間時間特徵的信心分數,而利用這個信心分數可以有效率地得到全域性整合空間時間的特徵,而這全域性整合空間時間的特徵可被拿來做動作辨識.	zh_TW
dc.description.abstract	In this thesis, we propose a brand-new architecture for action recognition. Recently, action recognition has been a rising issue in computer vision field. Deep learning method has been widely-used and is capable of generating generic model. Most of existing methods use either two-stream, RGB and stacking optical flow as inputs, or C3D, concatenating several RGB images as input, which cost much prices on time and memory. ResFlow, the proposed method, is pre-trained by optical flow dataset, FlyingChairs, and fine-tuned with action dataset, UCF101 and HMDB51, as a high level feature extractor. With optical flow pre-trained in first stage, spatiotemporal features are encoded in the latent high dimensional space in the middle of autoencoder architecture. In fine-tuning stage, the extracted spatiotemporal features from a set of frames from a video clip are given confidence scores by a designed Sequential Mechanism. This Sequential Mechanism takes ordered feature from the feature set as input and gives a confidence score to each feature to aggregate sequential features into a condensed feature which is leveraged for action recognition. This kind of design use only RGB images as input but with temporal information encoded, pre-trained by optical flow, and sequentially aggregate local spatiotemporal features into global spatiotemporal features in high efficiency for action recognition.	en
dc.description.provenance	Made available in DSpace on 2021-07-09T15:51:53Z (GMT). No. of bitstreams: 1 ntu-107-R05921061-1.pdf: 24899455 bytes, checksum: 5c5778b93075b6d2397ffd94c15f76a2 (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	致謝 i 中文摘要 ii Abstract iii Contents iv List of Figures vii List of Tables x 1 Introduction 1 1.1 Motivation.................................. 1 1.2 LiteratureReview.............................. 2 1.2.1 Action Understanding ....................... 2 1.2.2 Action Recognition..................3 1.2.3 Applications based on Convolution Neural Network...4 1.3 Challenge...........................4 1.4 Contribution..........................5 1.5 Thesis Organization......................6 2 Preliminary 7 2.1 Optical Flow ................................ 7 2.1.1 Traditional Methods ........................ 8 2.1.2 Autoencoders............................ 9 2.2 Action Recognition............................. 11 2.2.1 Two-Stream............................. 12 2.2.2 3D ConvNet ............................ 13 2.3 Joint Learning................................ 16 2.3.1 Object and Action ......................... 16 2.3.2 Optical Flow and Action...................... 17 3 Optical Flow Estimation 19 3.1 Design Concept............................... 19 3.2 Architecture................................. 21 3.2.1 Encoder............................... 21 3.2.2 Decoder............................... 22 3.3 Refinement Network ............................ 23 3.4 Remarks................................... 25 4 Action Recognition 26 4.1 Optical Flow to Action Recognition .................... 26 4.2 ResFlow................................... 27 4.3 SequentiallyPoolingMechanism...................... 29 4.4 ImplementationDetails............................ 32 5 Experiment 34 5.1 Optical Flow Dataset ............................ 34 5.1.1 FlyingChairsDataset........................ 35 5.1.2 SintelDataset............................ 37 5.2 ActionRecognitionDataset......................... 38 5.2.1 UCF101Dataset .......................... 38 5.2.2 HMDB51Dataset.......................... 41 5.2.3 Discussion ............................. 41 5.3 OpticalFlowEstimation .......................... 41 5.3.1 Comparison............................. 42 5.3.2 Refinement............................. 43 5.3.3 Sinteldataset ............................ 45 5.3.4 Remark............................... 48 5.4 ActionRecognition............................. 48 5.4.1 SpatiotemporalFeature....................... 49 5.4.2 Comparison............................. 50 5.4.3 Multitasking ............................ 51 5.4.4 Remark............................... 52 6 Conclusion.......56 7 Reference....57
dc.language.iso	zh-TW
dc.subject	序列式機制	zh_TW
dc.subject	光流	zh_TW
dc.subject	動作辨識	zh_TW
dc.subject	sequential mechanism	en
dc.subject	action recognition	en
dc.subject	optical flow estimation	en
dc.title	以全域性空間時間特徵輔以序列式生成機制完成多工之動作辨識及產生光流影像	zh_TW
dc.title	Using Global Spatiotemporal Features with Sequentially Pooling Mechanism for Multi-tasking of Action Recognition and Optical Flow Estimation	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	黃正民,廖弘源,王鈺強,范欽雄
dc.subject.keyword	動作辨識,光流,序列式機制,	zh_TW
dc.subject.keyword	action recognition,optical flow estimation,sequential mechanism,	en
dc.relation.page	61
dc.identifier.doi	10.6342/NTU201802828
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2018-08-10
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
dc.date.embargo-lift	2023-08-21	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-R05921061-1.pdf	24.32 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。