Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資料科學學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74455
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor鄭士康(Shyh-Kang Jeng)
dc.contributor.authorYung-Han Huangen
dc.contributor.author黃永翰zh_TW
dc.date.accessioned2021-06-17T08:36:45Z-
dc.date.available2020-08-19
dc.date.copyright2019-08-19
dc.date.issued2019
dc.date.submitted2019-08-08
dc.identifier.citation[1] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pages 5803–5812, 2017.
[2] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2911–2920, 2017.
[3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
[4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[5] J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 162–171, 2018.
[6] Y. Feng, L. Ma, W. Liu, T. Zhang, and J. Luo. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018.
[7] J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, pages 5267–5275, 2017.
[8] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision, pages 3628–3636, 2017.
[9] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang. Co-attention cnns for unsupervised object co-segmentation. In IJCAI, pages 748–756, 2018.
[10] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 41(6):797–819, 2011.
[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[12] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
[13] J. Lei Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[14] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
[15] P. Nguyen, T. Liu, G. Prasad, and B. Han. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6752–6761, 2018.
[16] S. Paul, S. Roy, and A. K. Roy-Chowdhury. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 563–579, 2018.
[17] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-deconvolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5734–5743, 2017.
[18] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang. Autoloc: Weaklysupervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–171, 2018.
[19] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
[20] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[21] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[23] L. Wang, Y. Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017.
[24] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
[25] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018.
[26] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1247–1257, 2019.
[27] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923, 2017.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74455-
dc.description.abstract影片重定位的目標是根據我們所給予的查詢影片,在參考影片中找出我們所感興趣的片段-本研究探討的是動作片段。本論文提出多尺度注意力機制模型來處理弱監督學習下的影片重定位問題。在訓練模型的過程中,不需要使用參考影片中動作片段的位置資訊。整體模型由三個模組所組成:第一部份使用預訓練的C3D模型提取特徵;接著設計一個多尺度注意力模組,根據不同尺度的特徵來計算參考影片與查詢影片的相似度:這樣的機制可保留更多時間資訊。最後,定位模組對參考影片中每個時間點的特徵進行預測,判斷其是否屬於動作片段。本研究使用共注意力損失函數來訓練模型,利用特徵之間的相似度,將動作片段從參考影片中分離出來。此外,只要加上交叉熵損失函數,我們的模型可以輕易修改成監督式學習的版本。我們在一個公開的資料集上測試,無論是監督式學習或是弱監督式學習,都獲得目前最好的效果。zh_TW
dc.description.abstractVideo re-localization aims to localize the segment we are interested in, an activity segment in this work, in the reference according to the query video we given. In this thesis, we propose an attention-based model to deal with this problem under a weakly-supervised setting. In other words, we train our model without the label of the action clip location in the reference video. Our model is composed of three major modules. In the first module, we employ a pre-trained C3D network as a feature extractor. For the second module, we design an attention mechanism to compute the similarity between the query video and a reference video based on multiscaled features. It can better preserve the local temporal information. Finally, in the final module, a localization layer predicts whether the feature of each time step in the reference video belongs to the query action segment. The full model is based on co-attention loss which separates an action instance from the reference video by utilizing the similarity between features. Our model can also be easily revised to fully supervised setting by updating the cross entropy loss to exploit the given location information. We evaluate our model on a public dataset and achieve the state-of-the-art performance under both weakly supervised and fully supervised settings.en
dc.description.provenanceMade available in DSpace on 2021-06-17T08:36:45Z (GMT). No. of bitstreams: 1
ntu-108-R06946015-1.pdf: 3017800 bytes, checksum: a6b3d33518fbf718572816a2a6148b09 (MD5)
Previous issue date: 2019
en
dc.description.tableofcontentsAcknowledgments i
Abstract ii
List of Figures viii
List of Tables x
Chapter 1 Introduction 1
1.1 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Related Work 5
2.1 Action recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Action localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Action localization with sentence/video. . . . . . . . . . . . . . . . . 7
Chapter 3 Preliminaries 9
3.1 Multi-head attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Layer normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 1D convolutional neural network . . . . . . . . . . . . . . . . . . . . . 12
Chapter 4 Approach 14
4.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Attention modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Self-attention module . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Multi-scale attention module . . . . . . . . . . . . . . . . . . . 16
4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2 Attention modules . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.3 Localization module . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.1 Co-attention loss . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.2 Weighted cross entropy loss . . . . . . . . . . . . . . . . . . . 24
4.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 5 Experiments 26
5.1 Dataset and Evaluation Metric . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Dataset ActivityNetR. . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Compared methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Comparison under weakly supervision . . . . . . . . . . . . . . . . . . 30
5.5 Comparison under fully supervision . . . . . . . . . . . . . . . . . . . 32
5.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 6 Conclusions 35
Bibliography 36
Chapter A Program operations manual 1
A.1 Training and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1
A.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
dc.language.isoen
dc.title使用多尺度注意模型之弱監督學習式影片動作片段重定位zh_TW
dc.titleWeakly Supervised Video Re-localization of Action Segments with Multiscale Attention Modelen
dc.typeThesis
dc.date.schoolyear107-2
dc.description.degree碩士
dc.contributor.coadvisor林彥宇(Yen-Yu Lin)
dc.contributor.oralexamcommittee莊永裕(Yung-Yu Chuang),王鈺強(Yu-Chiang Frank Wang),邱維辰(Wei-Chen Chiu)
dc.subject.keyword影片重定位,深度學習,注意力機制,共注意損失函數,zh_TW
dc.subject.keywordVideo Re-localization,Deep Learning,Attention Mechanism,Co-attention Loss,en
dc.relation.page42
dc.identifier.doi10.6342/NTU201902914
dc.rights.note有償授權
dc.date.accepted2019-08-10
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資料科學學位學程zh_TW
顯示於系所單位:資料科學學位學程

文件中的檔案:
檔案 大小格式 
ntu-108-1.pdf
  目前未授權公開取用
2.95 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved