請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74146
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 徐宏民(Winston H. Hsu) | |
dc.contributor.author | Hu-Cheng Lee | en |
dc.contributor.author | 李胡丞 | zh_TW |
dc.date.accessioned | 2021-06-17T08:21:48Z | - |
dc.date.available | 2022-08-21 | |
dc.date.copyright | 2019-08-21 | |
dc.date.issued | 2019 | |
dc.date.submitted | 2019-08-13 | |
dc.identifier.citation | [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv,2016.
[2] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. [3] Z.Cao,T.Simon,S.-E.Wei,and Y.Sheikh. Real time multi-person 2d pose estimation using part affinity fields. In CVPR, 2017. [4] J.Carreira etal.Quovadis,action recognition? a new model and the kinetics dataset. In CVPR, 2017. [5] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description.In CVPR, 2015. [6] C.Feichtenhofer, A.Pinz,and R.Wildes. Spatio temporal residual networks for video action recognition. In NIPS, 2016. [7] K.He, X.Zhang, S.Ren, and J.Sun. Deep residual learning for image recognition. In CVPR, 2016. [8] J.Hu, L.Shen, and G.Sun. Squeeze-and-excitation networks. In CVPR, 2018. [9] H.Jégou, M. Douze, C.Schmid, and P.Pérez. Aggregating local descriptors into a compact image representation. In CVPR,2010. [10] A.Karpathy, G.Toderici, S.Shetty, T.Leung, R.Sukthankar, and L.Fei-Fei. Large scale video classification with convolutional neural networks. In CVPR, 2014. [11] W.Kay, J.Carreira, K.Simonyan, B.Zhang, C.Hillier, S.Vijayanarasimhan, F.Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv, 2017. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,2012. [13] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV,2011. [14] T.-Y. Lin, A. Roy Chowdhury, and S. Maji. Bilinear cnn models for fine-grained visualrecognition. In ICCV,2015. [15] D.G.Lowe. Object recognition from local scale-invariant features. In ICCV,1999. [16] C.-Y.Ma, A.Kadav, I.Melvin, Z.Kira, G.AlRegib, and H.PeterGraf. Attendand interact: Higher-order object interactions for video understanding. In CVPR, 2018. [17] M. Monfort, B. Zhou, S. A. Bargal, A. Andonian, T. Yan, K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, et al. Moments in time dataset: one million videos for event understanding. arXiv, 2018. [18] F.Perronnin, J.Sánchez, and T.Mensink. Improving the fisher kernel for large-scale image classification. In ECCV,2010. [19] Rohrbachetal. A database for finegrained activity detection of cooking activities. In CVPR, 2012. [20] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS,2014. [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR,2015. [22] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012. [23] L. Sun et al. Human action recognition using factorized spatio-temporal convolutionalnetworks. In ICCV, 2015. [24] D.Tran, L.Bourdev, R.Fergus, L.Torresani,and M.Paluri. Learning spatio temporal features with 3d convolutional networks. In ICCV, 2015. [25] A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, Ł.Kaiser, and I.Polosukhin. Attention is all you need. In NIPS, 2017. [26] L. Wang et al. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015. [27] L.Wang, Y.Xiong, Z.Wang, Y.Qiao, D.Lin, X.Tang, and L.ValGool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. [28] X.Wangetal. Actions transformations. In CVPR, 2016. [29] H.Zhang, T.Xu,M.Elhoseiny, X.Huang, S.Zhang, A.Elgammal, and D.Metaxas. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In CVPR, 2016. [30] N.Zhang, E.Shelhamer, Y.Gao, and T.Darrell. Fine-grained pose prediction, normalization, and recognition. arXiv, 2015. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74146 | - |
dc.description.abstract | 卷積神經網路在近期影片動作識別的研究上有很大的進展。然而先前的研究只注重在影片的整體外觀資訊,忽略了不同動作之間的相似性以及相同動作之間的多樣性。我們從現今動作識別方法的失敗例子觀察到,這些被分類錯誤的動作會因為外觀上的像似性而容易彼此混淆。因此,我們將特別針對此類問題來解決—細精度動作識別。此類問題的挑戰有兩項:同一動作類別內有高度差異、以及不同動作類別間有相似性。在此篇研究中,我們提出了多串流雙線性卷積模型來解決細精度動作識別問題。我們的模型同時使用了影像整體的外觀資訊來對整部影片取出重點特徵,以及使用了細精度的跨模態互動來獲得不同模態間的關係。我們將模型使用在 HMDB51 和部分 Kinetics資料集,成功地提升細精度動作識別的準確性以及降低混淆率,證明我們的模型對細精度動作識別有更完整的了解。 | zh_TW |
dc.description.abstract | Recent studies have demonstrated the effectiveness of convolutional neural networks for action recognition. However, previous works only focus on coarse-grained appearance and ignore the similarity between action classes and diversity within the same action class. Our motivation stems from the observation that some failure cases of existing approaches are easily confused with each other. Towards this end, we target at a promising direction -- Fine-Grained Action Recognition. The challenges of fine-grained action recognition are high intra-class variation and low inter-class variation. In this paper, we propose Multi-Stream Bilinear Model (MSBM) to address fine-grained action recognition problem. Our model leverages both coarse-grained context information and fine-grained cross-modality (CM) interaction to summarize the whole video sequence and capture the relationship between different modalities simultaneously. We demonstrate the influence of modeling cross-modality interaction with informative CM channel selection for significantly improving the accuracy and reducing the confusion rate between easily-confused classes. Evaluating our approach on HMDB51 and the subset of Kinetics, we show that our MSBM performs favorably against the state-of-the-art architectures, enabling a richer understanding of fine-grained action recognition in video. | en |
dc.description.provenance | Made available in DSpace on 2021-06-17T08:21:48Z (GMT). No. of bitstreams: 1 ntu-108-R05922174-1.pdf: 1946468 bytes, checksum: d030d58ae5f9e63941dad9700c026ee0 (MD5) Previous issue date: 2019 | en |
dc.description.tableofcontents | 口試委員會審定書 iii
誌謝 v 摘要 vii Abstract ix 1 Introduction 1 2 Related Work 5 2.1 Action Recognition 5 2.2 Fine-Grained Imag Classification 6 3 Network Architecture 9 3.1 Coarse-grained–Context Information 9 3.2 Fine-grained–Cross-Modality Interaction 11 3.3 Consensus 14 4 Experiments 15 4.1 Datasets 15 4.2 Implementation Details 16 4.3 Main Results 17 5 Conclusion 23 Bibliography 25 | |
dc.language.iso | en | |
dc.title | 利用多串流雙線性卷積模型應用在細精度動作識別 | zh_TW |
dc.title | MSBM: Multi-Stream Bilinear Model for Fine-grained Action Recognition | en |
dc.type | Thesis | |
dc.date.schoolyear | 107-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 陳文進,葉梅珍,余能豪 | |
dc.subject.keyword | 動作識別,卷積神經網路,多串流雙線性卷積,跨模態互動, | zh_TW |
dc.subject.keyword | action recognition,convolutional neural network,multi-stream bilinear convolution,cross-modality interaction, | en |
dc.relation.page | 27 | |
dc.identifier.doi | 10.6342/NTU201902156 | |
dc.rights.note | 有償授權 | |
dc.date.accepted | 2019-08-14 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-1.pdf 目前未授權公開取用 | 1.9 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。