深度影片解析及其在動作識別中的應用

Sebastian Agethen; 蔡格昇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74089

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民(Winston H. Hsu)
dc.contributor.author	Sebastian Agethen	en
dc.contributor.author	蔡格昇	zh_TW
dc.date.accessioned	2021-06-17T08:19:31Z	-
dc.date.available	2019-08-18
dc.date.copyright	2019-08-18
dc.date.issued	2019
dc.date.submitted	2019-08-14
dc.identifier.citation	Bibliography [1] Bvlc caffe on github. https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet. Accessed: 2015-09-23. [2] C. Arth, C. Pirchheim, J. Ventura, D. Schmalstieg, and V. Lepetit. Instant outdoor localization and SLAM initialization from 2.5d maps. IEEE Trans. Vis. Comput.Graph., 21(11):1309–1318, 2015. [3] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and C. Rother. Uncertainty-driven 6d pose estimation of objects and scenes from a single RGB image. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 3364–3372,2016. [4] M. Bregonzio, S. Gong, and T. Xiang. Recognising action as clouds of space-time interest points. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1948–1955, June 2009. [5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017. [6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, July 2017. [7] D. Caruso, J. Engel, and D. Cremers. Large-scale direct slam for omnidirectional cameras. In IROS, 2015. [8] G. Ch’eron, I. Laptev, and C. Schmid. P-CNN: Pose-based CNN Features for Action Recognition. In ICCV, 2015. [9] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. [10] M. D Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. volume 8689, 11 2013. [11] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893 vol. 1, June 2005. [12] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):677–691, April 2017. [13] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. pages 3745–3754, 10 2017. [14] Z. Fan, X. Zhao, T. Lin, and H. Su. Attention based multi-view re-observation fusion network for skeletal action recognition. IEEE Transactions on Multimedia, pages 1–1, 2018. [15] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [16] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 19(9):2045–2055, Sept 2017. [17] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014. [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. [19] A. Guzmán-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. W. Fitzgibbon,and S. Izadi. Multi-output learning for camera relocalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1114–1121, 2014. [20] B. Hariharan, P. A. Arbeláez, R. B. Girshick, and J. Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision (ECCV), pages 297–312, 2014. [21] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004. [22] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, Feb. 2015. [23] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997. [24] S. Im, H. Ha, F. Rameau, H.-G. Jeon, G. Choe, and I. S. Kweon. All-around depth from small motion with a spherical panoramic camera. In European Conference on Computer Vision (ECCV), pages 156–172, 2016. [25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456. JMLR.org, 2015. [26] A. Irschara, C. Zach, J. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 2599–2606, 2009. [27] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3(1):79–87, Mar. 1991. [28] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In International Conf. on Computer Vision (ICCV), pages 3192–3199, Dec. 2013. [29] A. Karpathy, J. Johnson, and F. Li. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078, 2015. [30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, June 2014. [31] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, A. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017. [32] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning for camera relocalization. International Conference on Robotics and Automation (ICRA), page 0,2016. [33] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for realtime 6-dof camera relocalization. pages 2938–2946, 2015. [34] S. Kim, S. Hong, M. Joh, and S. Song. Deeprain: Convlstm network for precipitationprediction using multichannel radar data. CoRR, abs/1711.02316, 2017. [35] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [36] A. Klaser, M. Marszalek, and C. Schmid. A Spatio-Temporal Descriptor Based on 3D-Gradients. In M. Everingham, C. Needham, and R. Fraile, editors, BMVC 2008 - 19th British Machine Vision Conference, pages 275:1–10, Leeds, United Kingdom, Sept. 2008. British Machine Vision Association. [37] H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):14–29, Jan 2016. [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [39] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011. [40] I. Laptev. On space-time interest points. Int. J. Comput. Vision, 64(2-3):107–123,Sept. 2005. [41] LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, pages 396–404. Morgan Kaufmann, 1990. [42] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989. [43] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, pages 97–104, Washington, DC, USA, 2004. IEEE Computer Society. [44] N. Li, Yunpeng andgo Snavely and D. P. Huttenlocher. Location recognition using prioritized feature matching. In European Conference on Computer Vision (ECCV), pages 791–804, 2010. [45] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide pose estimation using 3d point clouds. In European Conference on Computer Vision (ECCV), pages 15–29, 2012. [46] Z. Li, E. Gavves, M. Jain, and C. G. M. Snoek. Videolstm convolves, attends and flows for action recognition. CoRR, abs/1607.01794, 2016. [47] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. [48] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004. [49] A. Majdik, D. Verda, Y. Albers-Schoenberg, and D. Scaramuzza. Micro air vehicle localization and position tracking from textured 3d cadastral models. In IEEE International Conference on Robotics and Automation (ICRA), pages 920–927, 2014. [50] M. Mathieu, C. Couprie, and Y. Lecun. Deep multi-scale video prediction beyond mean square error. 11 2015. [51] M. Monfort, B. Zhou, S. A. Bargal, T. Yan, A. Andonian, K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, et al. Moments in time dataset: one million videos for event understanding. [52] D. Nistér, O. Naroditsky, and J. R. Bergen. Visual odometry. In IEEE Computer Vision and Pattern Recognition, pages 652–659, 2004. [53] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015. [54] M. Rochan et al. Future semantic segmentation with convolutional lstm. arXiv preprint arXiv:1807.07946, 2018. [55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [56] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [57] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys. Hyperpoints and fine vocabularies for large-scale location recognition. In IEEE International Conference on Computer Vision (ICCV), pages 2102–2110, 2015. [58] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization using direct 2d-to-3d matching. In IEEE International Conference on Computer Vision (ICCV), pages 667–674, 2011. [59] T. Sattler, B. Leibe, and L. Kobbelt. Improving image-based localization by active correspondence search. In European Conference on Computer Vision (ECCV), pages 752–765, 2012. [60] T. Sattler, B. Leibe, and L. Kobbelt. Efficient effective prioritized matching for large-scale image-based localization. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2016. [61] J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. In IEEE Computer Vision and Pattern Recognition (CVPR), 2016. [62] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations (ICLR), 2013. [63] X. SHI, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. WOO. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 802–810. Curran Associates, Inc., 2015. [64] Y. Shi, Y. Tian, Y. Wang, and T. Huang. Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Transactions on Multimedia, 19(7):1510–1520, July 2017. [65] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. W. Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 2930–2937, 2013. [66] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan. Skeleton-based action recognition with spatial reasoning and temporal stack learning. CoRR, abs/1805.02335, 2018. [67] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 2016. [68] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 568–576, Cambridge, MA, USA, 2014. MIT Press. [69] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [70] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. [71] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn.Res., 15(1):1929–1958, Jan. 2014. [72] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 843–852. JMLR.org, 2015. [73] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014. [74] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015. [75] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. 2015 IEEE International Conference on Computer Vision (ICCV), pages 4534–4542, 2015. [76] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. CoRR, abs/1412.4729, 2014. [77] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 98–106, June 2016. [78] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 613–621. Curran Associates, Inc., 2016. [79] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3352–3361, Oct 2017. [80] Q. Wang, C. Yuan, J. Wang, and W. Zeng. Learning attentional recurrent neural network for visual tracking. IEEE Transactions on Multimedia, pages 1–1, 2018. [81] X. Wang, L. Gao, P. Wang, X. Sun, and X. Liu. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia, 20(3):634–644, March 2018. [82] T. Weyand, I. Kostrikov, and J. Philbin. Planet - photo geolocation with convolutional neural networks. In European Conference on Computer Vision (ECCV), pages 37–5, 2016. [83] C. Wu. Towards linear-time incremental structure from motion. In International Conference on 3D Vision, pages 127–134, 2013. [84] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore bundle adjustment. In IEEE Computer Vision and Pattern Recognition, pages 3057–3064, 2011. [85] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. [86] Z. Wu, Y.-G. Jiang, X. Wang, H. Ye, and X. Xue. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, pages 791–800, New York, NY, USA, 2016. ACM. [87] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang. Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pages 177–186, New York, NY, USA, 2014. ACM. [88] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3320–3328. Curran Associates, Inc., 2014. [89] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [90] S. Yuksel, J. Wilson, and P. Gader. Twenty years of mixture of experts. Neural Networks and Learning Systems, IEEE Transactions on, 23(8):1177–1193, Aug 2012. [91] B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-scale image-based localization. In IEEE International Conference on Computer Vision (ICCV), pages 2704–2712, 2015. [92] Z. Zhang, H. Rebecq, C. Forster, and D. Scaramuzza. Benefit of large field-of-view cameras for vsisual odometry. In IEEE International Conference on Robotics and Automation (ICRA), pages 801–808, 2016. [93] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia, 19(6):1245–1256, June 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74089	-
dc.description.abstract	隨著更多可攜式設備能夠瀏覽互聯網，以及隨時錄製並播放影片，視頻在我們的日常生活中變得無所不在，而廣泛的應用需要有效理解視頻內容。然而，相比基於圖像的學習，影片本質上更具有時空性。這帶來幾個挑戰：首先，影片可用數據量是圖像學習的倍數, 因此從影片中培訓模型變得更加耗時。同時不相關的數據數量也在增長，於是發展從影片中截取相關內容的策略更是需要。其次，除了能夠進行空間推理，一個基於視頻的應用也需要一個時間模式來解構影片中的因果關係。我們在本論文中解決了這些核心挑戰：對於數據隨著時間不斷地變得可用的情況, 我們提出了一個基於專家的增量學習(expert-based incremental learning)方法，此方法允許深度學習能有效地訓練數據。接著，我們觀察到內核大小 (kernel size) 是連續影格之間能夠進行因果推理及序列學習的關鍵，因此我們提出的一個基於多內核(multi-kernel)的學習方法以提升結果的正確性，並同時保持學習的高效率。此外，我們通過拋棄不相關的東西，使深度學習模型能夠學習視頻內容中的抽象動作行為。為此，我們提出了一種基於提取細粒特徵(fine-grained feature)的注意機制(attention mechanism)。我們選擇行為預測(human action prediction)作為應用，並且搭配人體姿勢注意(human pose-attention mask)模型做為評估方法。最後，我們研究了基於視頻的學習在攝像機定位任務中的應用，並提出一個改進的正規化方案來提升定位的準確度。	zh_TW
dc.description.abstract	Videos have become omni-present in our daily lives with the rise in popularity of portable devices that are able to browse the internet, as well as take and playback recordings at any time. A wide range of applications require effective understanding of the contents in video. However, in comparison to image-based learning, they are inherently emph{spatio-temporal} by nature. This comes with several challenges: First, the quantity of data available is a multiple of that in image-based learning. As a direct consequence, training of such a model becomes far more time-consuming. Furthermore, however, the amount of irrelevant data also grows, and strategies to focus on relevant content are needed. Second, in addition to spatial reasoning, a video-based application requires the ability to understand and causally connect temporal patterns. We address these central challenges in this dissertation: We propose an incremental learning method based on experts that allows efficient training of deep applications for the case of data continuously becoming available over time. % Do we need to mention it is evaluated on images but can be extended to video? Following that, we identify the critical issue of kernel size towards the ability to causally reason between a collection of temporally spaced phenomena in sequence learning, and propose a multi-kernel solution that not only provides correct results, but also remains efficient at the same time. Furthermore, we enable deep models to learn abstract actions by discarding irrelevant video content. To this end we propose an attention mechanism to extract fine-grained features. We choose human action anticipation as the application, and evaluate the approach with human pose-based attention masks. Finally, we investigate the application of video-based learning to the tasks of camera localization and orientation and propose a regularization scheme to improve results.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:19:31Z (GMT). No. of bitstreams: 1 ntu-108-D01944015-1.pdf: 12059475 bytes, checksum: a315c18b84d5aed7013b3232414acc0a (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Acknowledgements iii 摘要 v Abstract vii 1 Introduction 1 2 Literature Review 5 3 Mediated Experts for Deep Convolutional Networks 9 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Simple Branching model . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.2 Branched experts with early stopping . . . . . . . . . . . . . . . . . . . 13 3.3.3 Mediator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.1 Superclass construction . . . . . . . . . . . . . . . . . . . . . . . . . .15 3.4.2 Results on Hierarchical set . . . . . . . . . . . . . . . . . . . . . . . 16 4 Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos 19 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.2 Problem of Kernel Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.3 Concatenation of multiple kernels . . . . . . . . . . . . . . . . . . . . 28 4.3.4 Attention-based masking . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4.3 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Anticipation of Human Actions with Pose-based Fine-grained Representations 45 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.1 Pose-based attention generation . . . . . . . . . . . . . . . . . . . . . .48 5.3.2 Actor feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.3 Decoding future representation . . . . . . . . . . . . . . . . . . . . . . 50 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4.1 Dataset: Charades . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4.3 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6 FishNet: A Camera Localizer using Deep Recurrent Networks 55 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3.1 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.3.2 Camera Pose Regression with LSTM . . . . . . . . . . . . . . . . . . . . . 60 6.3.3 Network Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4.1 Comparison with deep learning approach . . . . . . . . . . . . . . . . . . 63 6.4.2 Comparison with 2D-to-3D descriptor matching . . . . . . . . . . . . . . . 65 6.4.3 Comparison of Fisheye and Perspective camera . . . . . . . . . . . . . .. .66 6.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7 Conclusion and Future Work 73 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73 Bibliography 75
dc.language.iso	en
dc.title	深度影片解析及其在動作識別中的應用	zh_TW
dc.title	Deep Video Understanding and its Applications in Action Recognition	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	歐陽明,李明穗,陳文進,余能豪,葉梅珍
dc.subject.keyword	深度學習,序列學習,動作辨識,動作預測,多專家模型,增量學習,	zh_TW
dc.subject.keyword	deep learning,sequence learning,action recognition,action prediction,mixture of experts,incremental learning,	en
dc.relation.page	85
dc.identifier.doi	10.6342/NTU201900690
dc.rights.note	有償授權
dc.date.accepted	2019-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	11.78 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。