請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90602完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 傅立成 | zh_TW |
| dc.contributor.advisor | Li-Chen Fu | en |
| dc.contributor.author | 宋體淮 | zh_TW |
| dc.contributor.author | Ti-Huai Song | en |
| dc.date.accessioned | 2023-10-03T16:49:00Z | - |
| dc.date.available | 2023-11-09 | - |
| dc.date.copyright | 2023-10-03 | - |
| dc.date.issued | 2023 | - |
| dc.date.submitted | 2023-05-12 | - |
| dc.identifier.citation | K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," Computing Research Repository (CoRR), vol. abs/1409.1556, 2015.
K. Simonyan and A. Zisserman, "Two-Stream Convolutional Networks for Action Recognition in Videos," in Advances in Neural Information Processing System (NIPS), 2014. Kay, Will, et al., "The kinetics human action video dataset," ArXiv, vol. abs/1406.1078, 2014. T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, "High Accuracy Optical Flow Estimation Based on a Theory for Warping," in European Conference on Computer Vision (ECCV), 2004. D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "C3D: Generic Features for Video Analysis," ArXiv, vol. abs/1412.0767, 2014. C. Feichtenhofer, H. Fan, J. Malik, and K. He, "SlowFast Networks for Video Recognition," in 2019 IEEE International Conference on Computer Vision (ICCV), 2019. X. Peng, and C. Schmid, "Multi-region two-stream R-CNN for action detection," in European Conference on Computer Vision (ECCV), 2016. S. Ren, et al., "Faster r-cnn: Towards real-time object detection with region proposal networks." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, 2015. N. Murray, L. Marchesotti, and F. Perronnin, "AVA: A large-scale database for aesthetic visual analysis," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [J. Carreira, and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. G. Singh, et al., "Online real-time multiple spatiotemporal action localisation and prediction," in 2017 IEEE International Conference on Computer Vision (ICCV), 2017. W. Liu, et al., "Ssd: Single shot multibox detector," in 2016 IEEE International Conference on Computer Vision (ICCV), 2016. V. Kalogeiton, et al., "Action tubelet detector for spatio-temporal action localization." in 2017 IEEE International Conference on Computer Vision (ICCV), 2017. O. Köpüklü, X. Wei, and G. Rigoll, "You only watch once: A unified cnn architecture for real-time spatiotemporal action localization," ArXiv, vol. abs/1911.06644, 2019. J. Redmon, et al., "You only look once: Unified, real-time object detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. J. Zhao, and C. G. Snoek, "Dance with flow: Two-in-one stream action detection," in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. S. Sun, et al., "Optical flow guided feature: A fast and robust motion representation for video action recognition," in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Y. Zhu, et al., "Hidden two-stream convolutional networks for action recognition," Asian conference on computer vision. Springer, Cham, 2018. J. Stroud, et al., "D3d: Distilled 3d networks for video action recognition," in 2020 IEEE/CVF Winter Conference on Applications of Computer Vision. 2020. G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," ArXiv, vol. abs/1503.02531, 2015, 2.7. D. Lopez-Paz, et al., "Unifying distillation and privileged information," ArXiv, vol. abs/1511.03643, 2015. V. Vapnik, and R. Izmailov, "Learning using privileged information: similarity control and knowledge transfer," J. Mach. Learn. Res. 16.1 (2015): 2023-2049. N. C. Garcia, P. Morerio, and V. Murino. "Modality distillation with multiple stream networks for action recognition," in European Conference on Computer Vision (ECCV), 2018. N. Crasto, et al., "Mars: Motion-augmented rgb stream for action recognition," in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. R. M. French, "Catastrophic forgetting in connectionist networks," Trends in cognitive sciences 3.4 (1999): 128-135. C. Sun, et al., "Actor-centric relation network," in European Conference on Computer Vision (ECCV), 2018. J. Tang, et al., "Asynchronous interaction aggregation for action detection," in European Conference on Computer Vision (ECCV), 2020. C. Wu, et al., "Long-term feature banks for detailed video understanding," in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. J. Pan, et al., "Actor-context-actor relation network for spatio-temporal action localization," in 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. R. Adriana, et al., "Fitnets: Hints for thin deep nets," ArXiv, vol. abs/1412.6550, 2014. S. Zagoruyko, and K. Komodakis, "Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer," ArXiv, vol. abs/1612.03928, 2016. D. Tran, et al., "Learning spatiotemporal features with 3d convolutional networks," in 2015 IEEE International Conference on Computer Vision (ICCV), 2015. J. Hu, S. Li, and G. Sun, " Squeeze-and-excitation networks," in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Z. Li, et al., "Aarm: Action attention recalibration module for action recognition," Asian Conference on Machine Learning. PMLR, 2020. S. Woo, et al., "Cbam: Convolutional block attention module," in European Conference on Computer Vision (ECCV), 2018. M. B. Muhammad, and Y. Mohammed, "Eigen-cam: Class activation map using principal components," 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. X. Wang, et al., "Non-local neural networks," in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. H. Jhuang, et al., " Towards understanding action recognition," in 2013 IEEE International Conference on Computer Vision (ICCV), 2013. K. Soomro, R. Z. Amir, and S. Mubarak, "UCF101: A dataset of 101 human actions classes from videos in the wild." ArXiv, vol. abs/1212.0402, 2012. M. Everingham, et al., " The Pascal Visual Object Classes (VOC) Challenge," in 2010 IEEE International Conference on Computer Vision (ICCV), 2010. S. Xie, et al., " Aggregated residual transformations for deep neural networks," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Y. Li, et al., " Actions as moving points," in European Conference on Computer Vision (ECCV), 2020. J. Wu, et al., "Context-aware rcnn: A baseline for action detection in videos," in European Conference on Computer Vision (ECCV), 2020. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90602 | - |
| dc.description.abstract | 隨著深度學習與視訊影像內容分析的迅速發展,線上即時之時空間動作偵測因其在真實場景應用中的優勢而受到業界和學術界的廣泛關注。該任務有兩個關鍵層面對於更好地理解人類動作而言至關重要,分別為運動建模和長期關聯建模。因此,我們提出了一套運動增強策略和記憶增強策略,以改善網路在運動方面和長期資訊方面的表示能力,從而實現線上即時之時空間動作偵測。近年來有許多研究利用了特徵蒸餾的技術,使得網路可以學習到運動知識並且在推理期間又可無需進行光流計算和雙流架構。然而,由於直接對特徵進行覆寫的關係,特徵蒸餾可能會導致網路的空間外觀訊息的損失,並且抑制網路對於關鍵運動知識的學習。因此,我們提出了一個運動增強策略稱作是注意力引導運動蒸餾,該技術能讓網路在保留其空間外觀知識的同時,又可在注意力機制的引導下選擇性地學習關鍵的運動訊息。現今在一些研究中所使用的長期關聯建模方法並不適合應用在需要線上即時的場景中。因此,我們提出了一個記憶增強策略稱作是線上特徵記憶,它能夠以在線的方式向網路提供長期資訊進而增強網路的偵測能力,且能達成更有效率的推理。此外,為能有效地融合長期資訊和當前資訊,我們提出了一個時序增強的記憶運算子,該運算子解決了傳統交叉注意機制在計算長期和當前資訊間的關係時的時序建模上的不足。在實驗部分,我們展示了我們的方法在兩個公開數據集J-HMDB-51和UCF101-24上的有效性。並且我們的方法與其他相關研究相比也取得了優異的表現。 | zh_TW |
| dc.description.abstract | With the rapid growth of the deep learning and video content analysis, the task of online, real-time spatio-temporal action detection attracts wide attention from the industry and academic fields due to its competence in real-world applications. There are two key aspects of the task that are crucial for better understanding of a human action: motion modeling and long-term dependency modeling. Therefore, we propose a motion-augmented strategy and a memory-augmented one to improve the representation ability of the network in terms of motion and long-term reasoning for online real-time spatio-temporal action detection. Some recent works explored feature distillation to allow the network to obtain motion knowledge without the need of optical flow calculation and two stream design during inference. However, feature distillation may cause loss of the appearance information of the network, and also suppress the network’s learning on crucial motion knowledge due to the direct overwrite of the feature maps. As a result, the motion-augmented strategy we propose in this work is called Attention-Guided Motion Distillation, which allows the network to retain its appearance knowledge while selectively learning crucial motion information with the guidance of attention mechanism. The approaches that have been utilized in recent works for long-term dependency modeling is not suitable for online, real-time applications. Therefore, the memory-augmented strategy we propose here is called Online Feature Memory, which not only enhances detection by providing the long-term information to the network in an online manner, but also allows for efficient inference. Besides, to achieve effective integration between the long-term and the current information, we propose the Temporal-Enhanced Memory Operator, which addresses the limitation of conventional cross-attention mechanism in temporal modeling when computing relations between the long-term and the current information. In the experiment part, we show the effectiveness of our method on two benchmark datasets, J-HMDB-51, and UCF101-24. Our method also achieves superior performance as compared with the other related works. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-10-03T16:49:00Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2023-10-03T16:49:00Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 i
中文摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES viii LIST OF TABLES xii Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Literature Review 6 1.3.1 Action Recognition 7 1.3.2 Spatio-Temporal Action Detection 8 1.3.3 Motion Modeling 9 1.3.4 Generalized Distillation 11 1.3.5 Spatio-temporal Dependency Modeling 12 1.4 Contribution 13 1.5 Thesis Organization 14 Chapter 2 Preliminaries 16 2.1 Deep Neural Networks 16 2.2 Convolutional Neural Networks 18 2.3 3D Convolutional Networks 19 2.4 Residual Networks 20 2.5 Knowledge Distillation 22 2.6 You Only Watch Once (YOWO) 24 Chapter 3 Methodology 26 3.1 Problem Foundation 26 3.2 System Overview 27 3.3 Motion-Augmented Strategy 29 3.3.1 Attention Generation 30 3.3.2 Attention-Guided Motion Distillation 33 3.4 Memory-Augmented Strategy 37 3.4.1 Memory Generation 38 3.4.2 Memory Operator 39 3.4.3 Temporal-Enhanced Block 42 3.4.4 Online Memory Inference 47 Chapter 4 Experimental Results 48 4.1 Computational Environment 48 4.2 Datasets and Evaluation Metrics 49 4.2.1 J-HMDB-21 49 4.2.2 UCF101-24 50 4.2.3 Evaluation Metrics 51 4.3 Implementation Details 51 4.4 Ablation Studies 52 4.4.1 The Result of Attention-Guided Motion Distillation 53 4.4.2 The Result of Online Feature Memory 55 4.5 Comparison with Related Works 56 Chapter 5 Conclusion and Future Works 59 REFERENCE 60 | - |
| dc.language.iso | en | - |
| dc.subject | 時空間動作偵測 | zh_TW |
| dc.subject | 視訊影像內容分析 | zh_TW |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 記憶 | zh_TW |
| dc.subject | 知識蒸餾 | zh_TW |
| dc.subject | spatio-temporal action detection | en |
| dc.subject | video content analysis | en |
| dc.subject | deep learning | en |
| dc.subject | memory | en |
| dc.subject | knowledge distillation | en |
| dc.title | 以運動與記憶增強之網路用於線上即時之時空間動作偵測 | zh_TW |
| dc.title | Motion & Memory-Augmented Network for Online Real-Time Spatio-Temporal Action Detection | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 王鈺強;范欽雄;黃正民;莊永裕 | zh_TW |
| dc.contributor.oralexamcommittee | Yu-Chiang Wang;Chin-Shyurng Fahn;Cheng-Ming Huang;Yung-Yu Chuang | en |
| dc.subject.keyword | 深度學習,視訊影像內容分析,時空間動作偵測,知識蒸餾,記憶, | zh_TW |
| dc.subject.keyword | deep learning,video content analysis,spatio-temporal action detection,knowledge distillation,memory, | en |
| dc.relation.page | 64 | - |
| dc.identifier.doi | 10.6342/NTU202300791 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2023-05-12 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電機工程學系 | - |
| 顯示於系所單位: | 電機工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf 未授權公開取用 | 3.24 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
