Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54961
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor傅立成(Li-Chen Fu)
dc.contributor.authorJui-Ting Shenen
dc.contributor.author沈睿庭zh_TW
dc.date.accessioned2021-06-16T03:42:46Z-
dc.date.available2024-02-05
dc.date.copyright2021-02-22
dc.date.issued2021
dc.date.submitted2021-02-05
dc.identifier.citation[1] R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. G. M. Snoek, and T. Tuytelaars, 'Online Action Detection,' ArXiv, vol. abs/1604.06506, 2016.
[2] H. Eun, J. Moon, J. Park, C. Jung, and C. Kim, 'Learning to Discriminate Information for Online Action Detection,' in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[3] K. Cho et al., 'Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,' ArXiv, vol. abs/1406.1078, 2014.
[4] S. Bai, J. Z. Kolter, and V. Koltun, 'An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,' ArXiv, vol. abs/1803.01271, 2018.
[5] K. Simonyan and A. Zisserman, 'Very Deep Convolutional Networks for Large-Scale Image Recognition,' Computing Research Repository (CoRR), vol. abs/1409.1556, 2015.
[6] J. Long, E. Shelhamer, and T. Darrell, 'Fully convolutional networks for semantic segmentation,' in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[7] S. Ren, K. He, R. B. Girshick, and J. Sun, 'Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, 2015.
[8] J. Redmon, S. Divvala, R. B. Girshick, and A. Farhadi, 'You Only Look Once: Unified, Real-Time Object Detection,' in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[9] K. Simonyan and A. Zisserman, 'Two-Stream Convolutional Networks for Action Recognition in Videos,' in Advances in Neural Information Processing System (NIPS), 2014.
[10] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, 'High Accuracy Optical Flow Estimation Based on a Theory for Warping,' Berlin, Heidelberg, in European Conference on Computer Vision (ECCV), 2004.
[11] S. Ji, W. Xu, M. Yang, and K. Yu, '3D Convolutional Neural Networks for Human Action Recognition,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, 2013.
[12] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, 'C3D: Generic Features for Video Analysis,' ArXiv, vol. abs/1412.0767, 2014.
[13] C. Feichtenhofer, H. Fan, J. Malik, and K. He, 'SlowFast Networks for Video Recognition,' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[14] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, 'Long-term recurrent convolutional networks for visual recognition and description,' in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[15] S. Hochreiter and J. Schmidhuber, 'Long Short-Term Memory,' Neural Computation, vol. 9, no. 8, 1997.
[16] S. Bai, J. Kolter, and V. Koltun, 'An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,' ArXiv, vol.abs/1803.01271, 2018.
[17] Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, 'A Pursuit of Temporal Accuracy in General Activity Detection,' ArXiv, vol. abs/1703.02716, 2017.
[18] Z. Shou, D. Wang, and S. Chang, 'Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs,' ArXiv, vol. abs/1601.02129, 2016.
[19] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia, 'TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals,' in 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
[20] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, 'BSN: Boundary Sensitive Network for Temporal Action Proposal Generation,' in European Conference on Computer Vision (ECCV), 2018.
[21] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, 'BMN: Boundary-Matching Network for Temporal Action Proposal Generation,' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[22] T. Lin, X. Zhao, and Z. Shou, 'Single Shot Temporal Action Detection,' Proceedings of the 25th ACM international conference on Multimedia, 2017.
[23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and C. Berg, 'SSD: Single Shot MultiBox Detector,' in European Conference on Computer Vision (ECCV), 2016.
[24] H. Wang and C. Schmid, 'Action Recognition with Improved Trajectories,' in 2013 IEEE International Conference on Computer Vision (ICCV), 2013.
[25] J. Gao, Z. Yang, and R. Nevatia, 'RED: Reinforced Encoder-Decoder Networks for Action Anticipation,' ArXiv, vol. abs/1707.04818, 2017.
[26] M. Xu, M. Gao, Y. Chen, L. Davis, and D. J. Crandall, 'Temporal Recurrent Networks for Online Action Detection,' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[27] K. Hornik, M. Stinchcombe, and H. White, 'Multilayer feedforward networks are universal approximators,' Neural Networks, vol. 2, no. 5, 1989.
[28] S. Hochreiter, 'The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions,' International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 06, no. 02, 1998.
[29] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image Recognition,' 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2016.
[30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, 'Caffe: Convolutional Architecture for Fast Feature Embedding,' in MM '14: Proceedings of the 22nd ACM international conference on Multimeadia, 2014.
[31] D. T. Hien, 'A guide to receptive field arithmetic for Convolutional Neural Networks,' ed. Machine Learning Research, Projects and Educational Materials, 2017.
[32] T.-Y. Lin, P. Doll·r, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, 'Feature Pyramid Networks for Object Detection,' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2017.
[33] K. He, X. Zhang, S. Ren, and J. Sun, 'Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, 2015.
[34] R. B. Girshick, 'Fast R-CNN,' in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440-1448, 2015.
[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, N. Gomez, L. Kaiser, and I. Polosukhin, 'Attention is All you Need,' in Advances in Neural Information Processing System (NIPS), 2017.
[36] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Gool, 'Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,' ArXiv, vol. abs/1608.00859, 2016.
[37] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, 'Temporal Action Detection with Structured Segment Networks,' in 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
[38] J. Chung, a. G¸lÁehre, K. Cho, and Y. Bengio, 'Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,' ArXiv, vol. abs/1412.3555, 2014.
[39] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, 'Object Instance Segmentation and Fine-Grained Localization Using Hypercolumns,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, 2017.
[40] Y. -G. Jiang, J.Liu, A. Zamir, G. Toderici, I. Laptev, M.Shah, and R. Sukthankar, 'THUMOS Challenge: Action Recognition with a Large Number of Classes,' ed, 2014.
[41] J. Carreira and A. Zisserman, 'Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[42] S. Ioffe and C. Szegedy, 'Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,' in International Conference on Machine Learning (ICML), 2015.
[43] D. P. Kingma and J. Ba, 'Adam: A Method for Stochastic Optimization,' CoRR, vol. abs/1412.6980, 2015.
[44] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang, 'CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos,' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54961-
dc.description.abstract近年來,影片內容分析獲得學界和業界廣泛地關注,而其中一個主要分支為動作檢測。大多數的研究將動作檢測視為離線問題。但是在諸如自動駕駛, 輔助機器人和監視系統之類的實際應用中,需要在線且即時地檢測出動作。因此在線的設定將更加實用。在線動作檢測旨在串流的影片當中,每一刻即時的偵測出發生的動作。
輸入的影片序列不僅包含了感興趣動作的幀,還包含背景(無動作發生)和其他不相關的幀。這些幀會導致網路學習出較不具判別性的特徵。在本論文中,我們提出了一個增強性時序關係模型並將其嵌入在時序卷積網路中。該模型根據輸入序列中每個時間點與感興趣動作的關聯性以及每個時間點的動作分數來更新特徵向量。與感興趣動作比較相關的特徵應被視為重要的資訊,而忽略那些較不相關的特徵。增強性時序關係模型對每個時間點的特徵給予一個代表與感興趣動作之關聯性的相關分數,以及一個表示發生動作可能性的動作分數。這兩個分數能引導網路提取出更具有辨別性的特徵,用來表示在當前禎所發生的動作。
本研究透過時序卷積網路來學習輸入序列的時序模型。在時序卷積網路中不同層級所輸出的特徵向量具有不同的感受視野,覆蓋在不同的時間長度上。然而較低層級的特徵具有較弱的語義表達能力,因此本研究設計了時間金字塔網絡,該網路藉由一個由上而下的結構,將高層級特徵較強的語意表達能力傳遞至低層級,從而在各個層級上都建構出語義表達能力強的特徵。如此可以通過具有多個時間尺度的特徵來辨識不同時間長度的動作。在實驗部分,本研究將所提出的的方法實作在於兩個公開的數據集THUMOS及TVSeries上,其表現勝過幾個基本比較模型,並且同時超越了當前最好的方法。
zh_TW
dc.description.abstractNowadays, video content analysis attracts wide attention from the industry and academic fields. One major branch of video content analysis is action detection. Most of the works view action detection as an off-line problem. However, in real-world applications such as autonomous driving, assistance robots, and surveillance systems, the actions need to be detected every moment in time. Therefore, an online setting will be more practical. Online action detection aims to identify actions as soon as each video frame arrives from a streaming video.
An input video sequence contains not only the action of interest frames but also background (non-action) and other irrelevant frames. Those frames will cause the network to learn less discriminative features. This thesis explores an Enhanced Relation Layer embedded in a Temporal Convolution Network (TCN), which updates the features according to their relevance to the action of interest and the actioness score. Relevant features should be considered essential and irrelevant features unessential. Enhanced Relation Layer gives each timestep a relevance score implying the relevance to the action of interest, and an actioness score indicates the probability of action occurrence. The scores guide the network to focus on those more essential features and learn a more discriminative representation for identifying the action that happens in the current timestep.
The temporal information of an input sequence is learned from TCN. The output feature of each layer in TCN has different receptive fields, focusing on different temporal scales. However, lower-level features are semantically weak. Therefore, we design a Temporal Pyramid Network with a top-down architecture to transform the strong semantic ability from higher-levels to lower-levels, building semantically strong feature sequences at all levels. In this way, we can identify actions with different temporal lengths with multi-temporal-scale features. In the experiment part, we apply our method to two benchmark datasets, THUMOS-14, and TVSerires. Our method achieves superior performance as compared with baseline networks and promising results as compared with the state-of-the-art works.
en
dc.description.provenanceMade available in DSpace on 2021-06-16T03:42:46Z (GMT). No. of bitstreams: 1
U0001-0402202123141800.pdf: 35873229 bytes, checksum: 0ef6f658d1ada8d8da4625451e3e04fe (MD5)
Previous issue date: 2021
en
dc.description.tableofcontents口試委員會審定書 #
誌謝 I
摘要 II
ABSTRACT IV
TABLE OF CONTENTS VI
LIST OF FIGURES IX
LIST OF TABLES XIV
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Literature Review 3
1.2.1 Action Recognition and Sequential Modeling 4
1.2.2 Action Detection with Deep Neural Network 5
1.2.3 Online Action Detection 9
1.3 Contribution 11
1.4 Thesis Organization 12
Chapter 2 Preliminaries 14
2.1 Convolutional Neural Network 14
2.1.1 Convolutional Layers 16
2.1.2 Residual Network 17
2.1.3 Two-stream Convolutional Network 19
2.1.4 3D Convolutional Network 20
2.1.5 Receptive Field 22
2.2 Temporal Convolution Network 24
2.3 Feature Pyramid Network 26
2.4 Self-Attention Mechanism 28
Chapter 3 Methodology 30
3.1 Problem Setting 30
3.2 Video Feature Extraction 32
3.3 Enhanced Relation Layer 33
3.3.1 Relevance Module 35
3.3.2 Enhanced Relevance Module 39
3.3.3 Actioness Module 41
3.3.4 Combination with Convolutional Network 43
3.4 Temporal Pyramid Network 46
3.4.1 Bottom-up Pathway 48
3.4.2 Top-down Pathway 49
3.4.3 Multi-layer Output 50
Chapter 4 Experiments 53
4.1 Configuration 53
4.2 Action Detection Datasets and Evaluation Metrics 54
4.2.1 THUMOS-14 54
4.2.2 TVSeries 55
4.2.3 Evaluation Metrics 57
4.3 Implementation Details 60
4.4 Ablation Studies 62
4.4.1 The Results of Enhanced Relation Layer 62
4.4.2 The Results of Temporal Pyramid Network 63
4.4.3 Effectiveness of the Proposed Method 64
4.5 Online Action Detection Results 66
4.5.1 Results of THUMOS-14 Dataset 66
4.5.2 Results of TVSeries Dataset 67
Chapter 5 Conclusion and Future Works 69
REFERENCE 70
dc.language.isoen
dc.subject特稱金字塔zh_TW
dc.subject深度學習zh_TW
dc.subject影片內容分析zh_TW
dc.subject線上動作檢測zh_TW
dc.subject時序模型zh_TW
dc.subjectdeep learningen
dc.subjectfeature pyramiden
dc.subjecttemporal modelingen
dc.subjectonline action detectionen
dc.subjectvideo content analysisen
dc.title具有增強關聯機制用於線上動作檢測之時序金字塔網路
zh_TW
dc.titleTemporal Pyramid Networks with Enhanced Relation Mechanism for Online Action Detection
en
dc.typeThesis
dc.date.schoolyear109-1
dc.description.degree碩士
dc.contributor.oralexamcommittee傅楸善(Chiou-Shann Fuh),莊永裕(Yung-Yu Chuang),方瓊瑤(Chiung-Yao Fang),陳永耀(Yung-Yaw Chen)
dc.subject.keyword深度學習,影片內容分析,線上動作檢測,時序模型,特稱金字塔,zh_TW
dc.subject.keyworddeep learning,video content analysis,online action detection,temporal modeling,feature pyramid,en
dc.relation.page73
dc.identifier.doi10.6342/NTU202100548
dc.rights.note有償授權
dc.date.accepted2021-02-08
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
U0001-0402202123141800.pdf
  未授權公開取用
35.03 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved