利用多串流雙線性卷積模型應用在細精度動作識別

Hu-Cheng Lee; 李胡丞

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74146

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民(Winston H. Hsu)
dc.contributor.author	Hu-Cheng Lee	en
dc.contributor.author	李胡丞	zh_TW
dc.date.accessioned	2021-06-17T08:21:48Z	-
dc.date.available	2022-08-21
dc.date.copyright	2019-08-21
dc.date.issued	2019
dc.date.submitted	2019-08-13
dc.identifier.citation	[1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv,2016. [2] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. [3] Z.Cao,T.Simon,S.-E.Wei,and Y.Sheikh. Real time multi-person 2d pose estimation using part affinity fields. In CVPR, 2017. [4] J.Carreira etal.Quovadis,action recognition? a new model and the kinetics dataset. In CVPR, 2017. [5] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description.In CVPR, 2015. [6] C.Feichtenhofer, A.Pinz,and R.Wildes. Spatio temporal residual networks for video action recognition. In NIPS, 2016. [7] K.He, X.Zhang, S.Ren, and J.Sun. Deep residual learning for image recognition. In CVPR, 2016. [8] J.Hu, L.Shen, and G.Sun. Squeeze-and-excitation networks. In CVPR, 2018. [9] H.Jégou, M. Douze, C.Schmid, and P.Pérez. Aggregating local descriptors into a compact image representation. In CVPR,2010. [10] A.Karpathy, G.Toderici, S.Shetty, T.Leung, R.Sukthankar, and L.Fei-Fei. Large scale video classification with convolutional neural networks. In CVPR, 2014. [11] W.Kay, J.Carreira, K.Simonyan, B.Zhang, C.Hillier, S.Vijayanarasimhan, F.Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv, 2017. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,2012. [13] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV,2011. [14] T.-Y. Lin, A. Roy Chowdhury, and S. Maji. Bilinear cnn models for fine-grained visualrecognition. In ICCV,2015. [15] D.G.Lowe. Object recognition from local scale-invariant features. In ICCV,1999. [16] C.-Y.Ma, A.Kadav, I.Melvin, Z.Kira, G.AlRegib, and H.PeterGraf. Attendand interact: Higher-order object interactions for video understanding. In CVPR, 2018. [17] M. Monfort, B. Zhou, S. A. Bargal, A. Andonian, T. Yan, K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, et al. Moments in time dataset: one million videos for event understanding. arXiv, 2018. [18] F.Perronnin, J.Sánchez, and T.Mensink. Improving the fisher kernel for large-scale image classification. In ECCV,2010. [19] Rohrbachetal. A database for finegrained activity detection of cooking activities. In CVPR, 2012. [20] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS,2014. [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR,2015. [22] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012. [23] L. Sun et al. Human action recognition using factorized spatio-temporal convolutionalnetworks. In ICCV, 2015. [24] D.Tran, L.Bourdev, R.Fergus, L.Torresani,and M.Paluri. Learning spatio temporal features with 3d convolutional networks. In ICCV, 2015. [25] A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, Ł.Kaiser, and I.Polosukhin. Attention is all you need. In NIPS, 2017. [26] L. Wang et al. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015. [27] L.Wang, Y.Xiong, Z.Wang, Y.Qiao, D.Lin, X.Tang, and L.ValGool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. [28] X.Wangetal. Actions transformations. In CVPR, 2016. [29] H.Zhang, T.Xu,M.Elhoseiny, X.Huang, S.Zhang, A.Elgammal, and D.Metaxas. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In CVPR, 2016. [30] N.Zhang, E.Shelhamer, Y.Gao, and T.Darrell. Fine-grained pose prediction, normalization, and recognition. arXiv, 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74146	-
dc.description.abstract	卷積神經網路在近期影片動作識別的研究上有很大的進展。然而先前的研究只注重在影片的整體外觀資訊，忽略了不同動作之間的相似性以及相同動作之間的多樣性。我們從現今動作識別方法的失敗例子觀察到，這些被分類錯誤的動作會因為外觀上的像似性而容易彼此混淆。因此，我們將特別針對此類問題來解決—細精度動作識別。此類問題的挑戰有兩項：同一動作類別內有高度差異、以及不同動作類別間有相似性。在此篇研究中，我們提出了多串流雙線性卷積模型來解決細精度動作識別問題。我們的模型同時使用了影像整體的外觀資訊來對整部影片取出重點特徵，以及使用了細精度的跨模態互動來獲得不同模態間的關係。我們將模型使用在 HMDB51 和部分 Kinetics資料集，成功地提升細精度動作識別的準確性以及降低混淆率，證明我們的模型對細精度動作識別有更完整的了解。	zh_TW
dc.description.abstract	Recent studies have demonstrated the effectiveness of convolutional neural networks for action recognition. However, previous works only focus on coarse-grained appearance and ignore the similarity between action classes and diversity within the same action class. Our motivation stems from the observation that some failure cases of existing approaches are easily confused with each other. Towards this end, we target at a promising direction -- Fine-Grained Action Recognition. The challenges of fine-grained action recognition are high intra-class variation and low inter-class variation. In this paper, we propose Multi-Stream Bilinear Model (MSBM) to address fine-grained action recognition problem. Our model leverages both coarse-grained context information and fine-grained cross-modality (CM) interaction to summarize the whole video sequence and capture the relationship between different modalities simultaneously. We demonstrate the influence of modeling cross-modality interaction with informative CM channel selection for significantly improving the accuracy and reducing the confusion rate between easily-confused classes. Evaluating our approach on HMDB51 and the subset of Kinetics, we show that our MSBM performs favorably against the state-of-the-art architectures, enabling a richer understanding of fine-grained action recognition in video.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:21:48Z (GMT). No. of bitstreams: 1 ntu-108-R05922174-1.pdf: 1946468 bytes, checksum: d030d58ae5f9e63941dad9700c026ee0 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員會審定書 iii 誌謝 v 摘要 vii Abstract ix 1 Introduction 1 2 Related Work 5 2.1 Action Recognition 5 2.2 Fine-Grained Imag Classification 6 3 Network Architecture 9 3.1 Coarse-grained–Context Information 9 3.2 Fine-grained–Cross-Modality Interaction 11 3.3 Consensus 14 4 Experiments 15 4.1 Datasets 15 4.2 Implementation Details 16 4.3 Main Results 17 5 Conclusion 23 Bibliography 25
dc.language.iso	en
dc.title	利用多串流雙線性卷積模型應用在細精度動作識別	zh_TW
dc.title	MSBM: Multi-Stream Bilinear Model for Fine-grained Action Recognition	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳文進,葉梅珍,余能豪
dc.subject.keyword	動作識別,卷積神經網路,多串流雙線性卷積,跨模態互動,	zh_TW
dc.subject.keyword	action recognition,convolutional neural network,multi-stream bilinear convolution,cross-modality interaction,	en
dc.relation.page	27
dc.identifier.doi	10.6342/NTU201902156
dc.rights.note	有償授權
dc.date.accepted	2019-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	1.9 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。