藉由多邊界盒多任務學習網路辨識遠距離動作

Chao-Lun Wu; 吳兆倫

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/1354

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳銘憲(Ming-Syan Chen)
dc.contributor.author	Chao-Lun Wu	en
dc.contributor.author	吳兆倫	zh_TW
dc.date.accessioned	2021-05-12T09:37:03Z	-
dc.date.available	2018-08-18
dc.date.available	2021-05-12T09:37:03Z	-
dc.date.copyright	2018-08-18
dc.date.issued	2018
dc.date.submitted	2018-08-15
dc.identifier.citation	Bibliography [1] M. Barekatain, M. Martí, H.-F. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger. Okutama-action: An aerial view video dataset for concurrent human action detection. In 1st Joint BMTT-PETS Workshop on Tracking and Surveillance, CVPR, pages 1–8, 2017. [2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3034–3042, June 2016. [3] N. F. Davar, T. de Campos, D. Windridge, J. Kittler, and W. Christmas. Domain adaptation in the context of sport video action recognition. In Domain Adaptation Workshop, in conjunction with NIPS, 2011. [4] C. R. de Souza, A. Gaidon, Y. Cabon, and A. M. López. Procedural generation of videos to train deep action recognition networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2594–2604, July 2017. [5] C. R. de Souza, A. Gaidon, E. Vig, and A. M. López. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In ECCV, pages 697–716. Springer, 2016. [6] A. Diba, V. Sharma, and L. V. Gool. Deep temporal linear encoding networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1541–1550, July 2017. [7] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468–3476, 2016. [8] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7445–7454, July 2017. [9] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, June 2016. [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. [11] G. Huang, Z. Liu, L. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, July 2017. [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [13] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1175–1183, July 2017. [14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, June 2014. [15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. [17] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563, Nov 2011. [18] Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam. Deep local video feature for action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1219–1225, July 2017. [19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec 1989. [20] G. Lev, G. Sadeh, B. Klein, and L. Wolf. RNN fisher vectors for action recognition and image annotation. In ECCV, pages 833–850. Springer, 2016. [21] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli. Attention transfer from web images for video recognition. In ACM MM, pages 1–9. ACM, 2017. [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. [23] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4694–4702, June 2015. [24] Z. Qiu, T. Yao, and T. Mei. Deep quantization: Encoding convolutional activations with deep generative model. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4085–4094, July 2017. [25] Z. Shen, Z. Liu, J. Li, Y. G. Jiang, Y. Chen, and X. Xue. DSOD: Learning deeply supervised object detectors from scratch. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1937–1945, Oct 2017. [26] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [28] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01, UCF Center for Research in Computer Vision, 2012. [29] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using LSTMs. In ICML, pages 843–852, 2015. [30] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu. Transfer learning for video recognition with scarce training data for deep convolutional neural network. arXiv preprint arXiv:1409.4127, 2014. [31] L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4597–4605, Dec 2015. [32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017. [33] C. Szegedy, W. Liu, Y.-Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015. [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, June 2016. [35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, Dec 2015. [36] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018. [37] H. Wang, A. Kläser, C. Schmid, and C. L. Liu. Action recognition by dense trajectories. In CVPR 2011, pages 3169–3176, June 2011. [38] H. Wang and C. Schmid. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision, pages 3551–3558, Dec 2013. [39] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4305–4314, June 2015. [40] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36. Springer, 2016. [41] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016. [42] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014. [43] H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba. SLAC: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374, 2017. [44] F. Zhu and L. Shao. Enhancing action recognition by cross-domain dictionary learning. In BMVC. Citeseer, 2013. [45] W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao. A key volume mining deep framework for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1991–1999, June 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/handle/123456789/1354	-
dc.description.abstract	無人機的技術在近幾年有著突破性的進展，並且在諸多領域有豐富的應用如監視、救援、運輸以及軍事方面等等。本研究的目標是建立一個能夠在無人機上偵測、辨識事件的模型。這個研究問題困難的部分有兩點:第一，無人機相關的錄影資料非常稀少，要能夠以少量資料訓練出一個泛化能力高的模型十分困難。第二，由於無人機的位置通常離地面較遠，拍攝到的人物動作占畫面的比例很小，會令模型難以辨認人物動作。為了解決這些問題，我們提出了兩步驟的模型。首先先以SSD偵測出人物的位置，之後再藉由多任務學習架構，跟大型人物動作資料庫一起訓練的模型來辨識無人機影像中人物的動作。我們以自己提出的無人機人物動作影像資料來驗證我們的模型。這個影像資料包含14種類型的人物動作。實驗結果說明我們提出的方法可以增加無人機影像的人物動作辨識率。	zh_TW
dc.description.abstract	The technology of drone has advanced significantly during the last few years, which enables drones to be deployed in many tasks including video surveillance, search and rescue, last-mile delivery and military operation. The great potentials attract many researchers to study visual recognition technologies for drone, e.g. object detection in aerial images. However, there is not much research related to action recognition in drone videos. In this thesis, we aim to develop a real-time action detector of drone that can recognize complex human actions such as running, eating, walking, etc. Action recognition in drone is a challenging task due to the following reasons. First, there is no large-scale action dataset of drone, and the scarcity of training data makes learning accurate neural networks difficult. Second, the actions happen at a distance and are hard to be localized. To address this first issue, we propose a multi-box multi-task network architecture for recognizing actions at a distance. The multi-box network is used to generate human location proposal, and the action recognition network is then applied to the proposed locations to detect actions. In terms of the data scarcity, we attach this problem by leveraging the existing large human action databases with multi-task learning. To evaluate the effectiveness of our method, we create a new drone action dataset with 138 videos and 14 different distant actions. Experimental results show that our proposed method can increase the action recognition rate in drone.	en
dc.description.provenance	Made available in DSpace on 2021-05-12T09:37:03Z (GMT). No. of bitstreams: 1 ntu-107-R04942143-1.pdf: 5388941 bytes, checksum: de8603dc8515bb309db1dbb6a91aa5fe (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員會審定書 i Acknowledgements ii 中文摘要 iii Abstract iv Contents vi List of Figures viii List of Tables x 1 Introduction 1 2 Related Work 4 2.1 Convolutional Neural Networks for Image Recognition 4 2.2 Convolutional Neural Networks for Action Recognition 5 2.3 Transfer Learning for Video Recognition 7 3 Proposed Method 8 3.1 Preliminaries 8 3.2 Training Framework 9 3.2.1 Detection Step 9 3.2.2 MTL Step 11 4 Experiment 13 4.1 Datasets 13 4.1.1 Large Scale Human Action Datasets 13 4.1.2 Drone Dataset 15 4.2 Experiment Settings 17 4.2.1 Environment 17 4.2.2 Training Details 17 4.2.3 Testing Details 20 4.3 Results 20 4.4 Discussion 24 5 Conclusion 27 Bibliography 28
dc.language.iso	en
dc.title	藉由多邊界盒多任務學習網路辨識遠距離動作	zh_TW
dc.title	Recognizing Distant Actions via Multi-box Multi-task Networks	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	廖弘源,楊得年,陳怡伶,帥宏翰
dc.subject.keyword	動作辨識,卷積類神經網路,多任務學習,	zh_TW
dc.subject.keyword	Action Recognition,Convolutional Neural Networks,Multitask Learning,	en
dc.relation.page	33
dc.identifier.doi	10.6342/NTU201803559
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2018-08-16
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	5.26 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。