具有增強關聯機制用於線上動作檢測之時序金字塔網路

Jui-Ting Shen; 沈睿庭

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54961

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成(Li-Chen Fu)
dc.contributor.author	Jui-Ting Shen	en
dc.contributor.author	沈睿庭	zh_TW
dc.date.accessioned	2021-06-16T03:42:46Z	-
dc.date.available	2024-02-05
dc.date.copyright	2021-02-22
dc.date.issued	2021
dc.date.submitted	2021-02-05
dc.identifier.citation	[1] R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. G. M. Snoek, and T. Tuytelaars, 'Online Action Detection,' ArXiv, vol. abs/1604.06506, 2016. [2] H. Eun, J. Moon, J. Park, C. Jung, and C. Kim, 'Learning to Discriminate Information for Online Action Detection,' in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [3] K. Cho et al., 'Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,' ArXiv, vol. abs/1406.1078, 2014. [4] S. Bai, J. Z. Kolter, and V. Koltun, 'An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,' ArXiv, vol. abs/1803.01271, 2018. [5] K. Simonyan and A. Zisserman, 'Very Deep Convolutional Networks for Large-Scale Image Recognition,' Computing Research Repository (CoRR), vol. abs/1409.1556, 2015. [6] J. Long, E. Shelhamer, and T. Darrell, 'Fully convolutional networks for semantic segmentation,' in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [7] S. Ren, K. He, R. B. Girshick, and J. Sun, 'Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, 2015. [8] J. Redmon, S. Divvala, R. B. Girshick, and A. Farhadi, 'You Only Look Once: Unified, Real-Time Object Detection,' in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [9] K. Simonyan and A. Zisserman, 'Two-Stream Convolutional Networks for Action Recognition in Videos,' in Advances in Neural Information Processing System (NIPS), 2014. [10] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, 'High Accuracy Optical Flow Estimation Based on a Theory for Warping,' Berlin, Heidelberg, in European Conference on Computer Vision (ECCV), 2004. [11] S. Ji, W. Xu, M. Yang, and K. Yu, '3D Convolutional Neural Networks for Human Action Recognition,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, 2013. [12] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, 'C3D: Generic Features for Video Analysis,' ArXiv, vol. abs/1412.0767, 2014. [13] C. Feichtenhofer, H. Fan, J. Malik, and K. He, 'SlowFast Networks for Video Recognition,' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [14] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, 'Long-term recurrent convolutional networks for visual recognition and description,' in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [15] S. Hochreiter and J. Schmidhuber, 'Long Short-Term Memory,' Neural Computation, vol. 9, no. 8, 1997. [16] S. Bai, J. Kolter, and V. Koltun, 'An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,' ArXiv, vol.abs/1803.01271, 2018. [17] Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, 'A Pursuit of Temporal Accuracy in General Activity Detection,' ArXiv, vol. abs/1703.02716, 2017. [18] Z. Shou, D. Wang, and S. Chang, 'Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs,' ArXiv, vol. abs/1601.02129, 2016. [19] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia, 'TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals,' in 2017 IEEE International Conference on Computer Vision (ICCV), 2017. [20] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, 'BSN: Boundary Sensitive Network for Temporal Action Proposal Generation,' in European Conference on Computer Vision (ECCV), 2018. [21] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, 'BMN: Boundary-Matching Network for Temporal Action Proposal Generation,' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [22] T. Lin, X. Zhao, and Z. Shou, 'Single Shot Temporal Action Detection,' Proceedings of the 25th ACM international conference on Multimedia, 2017. [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and C. Berg, 'SSD: Single Shot MultiBox Detector,' in European Conference on Computer Vision (ECCV), 2016. [24] H. Wang and C. Schmid, 'Action Recognition with Improved Trajectories,' in 2013 IEEE International Conference on Computer Vision (ICCV), 2013. [25] J. Gao, Z. Yang, and R. Nevatia, 'RED: Reinforced Encoder-Decoder Networks for Action Anticipation,' ArXiv, vol. abs/1707.04818, 2017. [26] M. Xu, M. Gao, Y. Chen, L. Davis, and D. J. Crandall, 'Temporal Recurrent Networks for Online Action Detection,' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [27] K. Hornik, M. Stinchcombe, and H. White, 'Multilayer feedforward networks are universal approximators,' Neural Networks, vol. 2, no. 5, 1989. [28] S. Hochreiter, 'The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions,' International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 06, no. 02, 1998. [29] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image Recognition,' 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2016. [30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, 'Caffe: Convolutional Architecture for Fast Feature Embedding,' in MM '14: Proceedings of the 22nd ACM international conference on Multimeadia, 2014. [31] D. T. Hien, 'A guide to receptive field arithmetic for Convolutional Neural Networks,' ed. Machine Learning Research, Projects and Educational Materials, 2017. [32] T.-Y. Lin, P. Doll·r, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, 'Feature Pyramid Networks for Object Detection,' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2017. [33] K. He, X. Zhang, S. Ren, and J. Sun, 'Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, 2015. [34] R. B. Girshick, 'Fast R-CNN,' in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440-1448, 2015. [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, N. Gomez, L. Kaiser, and I. Polosukhin, 'Attention is All you Need,' in Advances in Neural Information Processing System (NIPS), 2017. [36] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Gool, 'Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,' ArXiv, vol. abs/1608.00859, 2016. [37] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, 'Temporal Action Detection with Structured Segment Networks,' in 2017 IEEE International Conference on Computer Vision (ICCV), 2017. [38] J. Chung, a. G¸lÁehre, K. Cho, and Y. Bengio, 'Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,' ArXiv, vol. abs/1412.3555, 2014. [39] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, 'Object Instance Segmentation and Fine-Grained Localization Using Hypercolumns,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, 2017. [40] Y. -G. Jiang, J.Liu, A. Zamir, G. Toderici, I. Laptev, M.Shah, and R. Sukthankar, 'THUMOS Challenge: Action Recognition with a Large Number of Classes,' ed, 2014. [41] J. Carreira and A. Zisserman, 'Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [42] S. Ioffe and C. Szegedy, 'Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,' in International Conference on Machine Learning (ICML), 2015. [43] D. P. Kingma and J. Ba, 'Adam: A Method for Stochastic Optimization,' CoRR, vol. abs/1412.6980, 2015. [44] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang, 'CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos,' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54961	-
dc.description.abstract	近年來，影片內容分析獲得學界和業界廣泛地關注，而其中一個主要分支為動作檢測。大多數的研究將動作檢測視為離線問題。但是在諸如自動駕駛，輔助機器人和監視系統之類的實際應用中，需要在線且即時地檢測出動作。因此在線的設定將更加實用。在線動作檢測旨在串流的影片當中，每一刻即時的偵測出發生的動作。輸入的影片序列不僅包含了感興趣動作的幀，還包含背景（無動作發生）和其他不相關的幀。這些幀會導致網路學習出較不具判別性的特徵。在本論文中，我們提出了一個增強性時序關係模型並將其嵌入在時序卷積網路中。該模型根據輸入序列中每個時間點與感興趣動作的關聯性以及每個時間點的動作分數來更新特徵向量。與感興趣動作比較相關的特徵應被視為重要的資訊，而忽略那些較不相關的特徵。增強性時序關係模型對每個時間點的特徵給予一個代表與感興趣動作之關聯性的相關分數，以及一個表示發生動作可能性的動作分數。這兩個分數能引導網路提取出更具有辨別性的特徵，用來表示在當前禎所發生的動作。本研究透過時序卷積網路來學習輸入序列的時序模型。在時序卷積網路中不同層級所輸出的特徵向量具有不同的感受視野，覆蓋在不同的時間長度上。然而較低層級的特徵具有較弱的語義表達能力，因此本研究設計了時間金字塔網絡，該網路藉由一個由上而下的結構，將高層級特徵較強的語意表達能力傳遞至低層級，從而在各個層級上都建構出語義表達能力強的特徵。如此可以通過具有多個時間尺度的特徵來辨識不同時間長度的動作。在實驗部分，本研究將所提出的的方法實作在於兩個公開的數據集THUMOS及TVSeries上，其表現勝過幾個基本比較模型，並且同時超越了當前最好的方法。	zh_TW
dc.description.abstract	Nowadays, video content analysis attracts wide attention from the industry and academic fields. One major branch of video content analysis is action detection. Most of the works view action detection as an off-line problem. However, in real-world applications such as autonomous driving, assistance robots, and surveillance systems, the actions need to be detected every moment in time. Therefore, an online setting will be more practical. Online action detection aims to identify actions as soon as each video frame arrives from a streaming video. An input video sequence contains not only the action of interest frames but also background (non-action) and other irrelevant frames. Those frames will cause the network to learn less discriminative features. This thesis explores an Enhanced Relation Layer embedded in a Temporal Convolution Network (TCN), which updates the features according to their relevance to the action of interest and the actioness score. Relevant features should be considered essential and irrelevant features unessential. Enhanced Relation Layer gives each timestep a relevance score implying the relevance to the action of interest, and an actioness score indicates the probability of action occurrence. The scores guide the network to focus on those more essential features and learn a more discriminative representation for identifying the action that happens in the current timestep. The temporal information of an input sequence is learned from TCN. The output feature of each layer in TCN has different receptive fields, focusing on different temporal scales. However, lower-level features are semantically weak. Therefore, we design a Temporal Pyramid Network with a top-down architecture to transform the strong semantic ability from higher-levels to lower-levels, building semantically strong feature sequences at all levels. In this way, we can identify actions with different temporal lengths with multi-temporal-scale features. In the experiment part, we apply our method to two benchmark datasets, THUMOS-14, and TVSerires. Our method achieves superior performance as compared with baseline networks and promising results as compared with the state-of-the-art works.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T03:42:46Z (GMT). No. of bitstreams: 1 U0001-0402202123141800.pdf: 35873229 bytes, checksum: 0ef6f658d1ada8d8da4625451e3e04fe (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 I 摘要 II ABSTRACT IV TABLE OF CONTENTS VI LIST OF FIGURES IX LIST OF TABLES XIV Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Literature Review 3 1.2.1 Action Recognition and Sequential Modeling 4 1.2.2 Action Detection with Deep Neural Network 5 1.2.3 Online Action Detection 9 1.3 Contribution 11 1.4 Thesis Organization 12 Chapter 2 Preliminaries 14 2.1 Convolutional Neural Network 14 2.1.1 Convolutional Layers 16 2.1.2 Residual Network 17 2.1.3 Two-stream Convolutional Network 19 2.1.4 3D Convolutional Network 20 2.1.5 Receptive Field 22 2.2 Temporal Convolution Network 24 2.3 Feature Pyramid Network 26 2.4 Self-Attention Mechanism 28 Chapter 3 Methodology 30 3.1 Problem Setting 30 3.2 Video Feature Extraction 32 3.3 Enhanced Relation Layer 33 3.3.1 Relevance Module 35 3.3.2 Enhanced Relevance Module 39 3.3.3 Actioness Module 41 3.3.4 Combination with Convolutional Network 43 3.4 Temporal Pyramid Network 46 3.4.1 Bottom-up Pathway 48 3.4.2 Top-down Pathway 49 3.4.3 Multi-layer Output 50 Chapter 4 Experiments 53 4.1 Configuration 53 4.2 Action Detection Datasets and Evaluation Metrics 54 4.2.1 THUMOS-14 54 4.2.2 TVSeries 55 4.2.3 Evaluation Metrics 57 4.3 Implementation Details 60 4.4 Ablation Studies 62 4.4.1 The Results of Enhanced Relation Layer 62 4.4.2 The Results of Temporal Pyramid Network 63 4.4.3 Effectiveness of the Proposed Method 64 4.5 Online Action Detection Results 66 4.5.1 Results of THUMOS-14 Dataset 66 4.5.2 Results of TVSeries Dataset 67 Chapter 5 Conclusion and Future Works 69 REFERENCE 70
dc.language.iso	en
dc.subject	特稱金字塔	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	影片內容分析	zh_TW
dc.subject	線上動作檢測	zh_TW
dc.subject	時序模型	zh_TW
dc.subject	deep learning	en
dc.subject	feature pyramid	en
dc.subject	temporal modeling	en
dc.subject	online action detection	en
dc.subject	video content analysis	en
dc.title	具有增強關聯機制用於線上動作檢測之時序金字塔網路	zh_TW
dc.title	Temporal Pyramid Networks with Enhanced Relation Mechanism for Online Action Detection	en
dc.type	Thesis
dc.date.schoolyear	109-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	傅楸善(Chiou-Shann Fuh),莊永裕(Yung-Yu Chuang),方瓊瑤(Chiung-Yao Fang),陳永耀(Yung-Yaw Chen)
dc.subject.keyword	深度學習,影片內容分析,線上動作檢測,時序模型,特稱金字塔,	zh_TW
dc.subject.keyword	deep learning,video content analysis,online action detection,temporal modeling,feature pyramid,	en
dc.relation.page	73
dc.identifier.doi	10.6342/NTU202100548
dc.rights.note	有償授權
dc.date.accepted	2021-02-08
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
U0001-0402202123141800.pdf 未授權公開取用	35.03 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。