Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88100
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor徐宏民zh_TW
dc.contributor.advisorWinston Hsuen
dc.contributor.author蔡秉辰zh_TW
dc.contributor.authorBing-Chen Tsaien
dc.date.accessioned2023-08-08T16:18:02Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-08-
dc.date.issued2023-
dc.date.submitted2023-07-13-
dc.identifier.citation[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom­up and top­down attention for image captioning and visual question an­ swering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
[2] S.Antol,A.Agrawal,J.Lu,M.Mitchell,D.Batra,C.L.Zitnick,andD.Parikh.Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[3] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
[4] H.Bao,W.Wang,L.Dong,Q.Liu,O.K.Mohammed,K.Aggarwal,S.Som,S.Piao, and F. Wei. Vlmo: Unified vision­language pre­training with mixture­of­modality­ experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
[5] S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei­Fei, and J. C. Niebles. Revisit­ ing the” video” in video­language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022
[6] R. Cadene, H. Ben­Younes, M. Cord, and N. Thome. Murel: Multimodal rela­ tional reasoning for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1989–1998, 2019.
[7] D. Chen, C. Tao, L. Hou, L. Shang, X. Jiang, and Q. Liu. Litevl: Efficient video­language learning with enhanced spatial­temporal modeling. arXiv preprint arXiv:2210.11929, 2022.
[8] Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu. Language­bridged spatial­ temporal interaction for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4964– 4973, 2022.
[9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un­ terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[10] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang. Heteroge­ neous memory enhanced multimodal attention model for video question answer­ ing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007, 2019.
[11] T.­J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu. Violet: End­to­end video­language transformers with masked visual­token modeling. arXiv preprint arXiv:2111.12681, 2021.
[12] T.­J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu. An empiri­cal study of end­to­end video­language transformers with masked visual modeling. arXiv preprint arXiv:2209.01540, 2022.
[13] Z. Gan, Y.­C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu. Large­scale adversar­ ial training for vision­and­language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
[14] J. Gao, R. Ge, K. Chen, and R. Nevatia. Motion­appearance co­memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6576–6585, 2018.
[15] M. Grunde­McLaughlin, R. Krishna, and M. Agrawala. AGQA 2.0: An up­ dated benchmark for compositional spatio­temporal reasoning. arXiv preprint arXiv:2204.06105, 2022.
[16] Z. Guo, J. Zhao, L. Jiao, X. Liu, and L. Li. Multi­scale progressive attention net­ work for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 973– 978, 2021.
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[18] P. Jiang and Y. Han. Reasoning with heterogeneous graph alignment for video ques­ tion answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11109–11116, 2020.
[19] A. Kae and Y. Song. Image to video domain adaptation using web supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 567–575, 2020.
[20] W. Kim, B. Son, and I. Kim. Vilt: Vision­and­language transformer without con­ volution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
[21] T. M. Le, V. Le, S. Venkatesh, and T. Tran. Hierarchical conditional relation net­ works for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9972–9981, 2020.
[22] H.­Y. Lee, H.­T. Su, B.­C. Tsai, T.­H. Wu, J.­F. Yeh, and W. H. Hsu. Learning fine­ grained visual understanding for video question answering via decoupling spatial­ temporal modeling. arXiv preprint arXiv:2210.03941, 2022.
[23] J.Lei,T.L.Berg,andM.Bansal.Revealingsingleframebiasforvideo­and­language learning. arXiv preprint arXiv:2206.03428, 2022.
[24] J.Li,R.Selvaraju,A.Gotmare,S.Joty,C.Xiong,andS.C.H.Hoi.Alignbeforefuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
[25] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli. Attention transfer from web images for video recognition. In Proceedings of the 25th ACM international conference on multimedia, pages 1–9, 2017.
[26] L. Li, Y.­C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu. Hero: Hierarchi­ cal encoder for video+ language omni­representation pre­training. arXiv preprint arXiv:2005.00200, 2020.
[27] L.Li,Z.Gan,Y.Cheng,andJ.Liu.Relation­awaregraphattentionnetworkforvisual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10313–10322, 2019.
[28] L. H. Li, M. Yatskar, D. Yin, C.­J. Hsieh, and K.­W. Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
[29] X.Li,J.Song,L.Gao,X.Liu,W.Huang,X.He,andC.Gan.Beyondrnns:Positional self­attention with co­attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8658–8665, 2019.
[30] Y. Li, X. Wang, J. Xiao, and T.­S. Chua. Equivariant and invariant grounding for video question answering. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4714–4722, 2022.
[31] Y.Li,X.Wang,J.Xiao,W.Ji,andT.­S.Chua.Invariantgroundingforvideoquestion answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937, 2022.
[32] W. Lin, A. Kukleva, K. Sun, H. Possegger, H. Kuehne, and H. Bischof. Cycda: Unsupervised cycle domain adaptation to learn from image to video. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pages 698–715. Springer, 2022.
[33] F.Liu,J.Liu,W.Wang,andH.Lu.Hair:Hierarchicalvisual­semanticrelationalrea­ soning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1698–1707, 2021.
[34] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou. Univl: A unified video and language pre­training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
[35] W. Norcliffe­Brown, S. Vafeias, and S. Parisot. Learning conditioned graph struc­ tures for interpretable visual question answering. Advances in neural information processing systems, 31, 2018.
[36] L.Peng,S.Yang,Y.Bin,andG.Wang.Progressivegraphattentionnetworkforvideo question answering. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2871–2879, 2021.
[37] A. Piergiovanni, K. Morton, W. Kuo, M. S. Ryoo, and A. Angelova. Video question answering with iterative video­text co­tokenization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 76–94. Springer, 2022.
[38] S. Ren, K. He, R. Girshick, and J. Sun. Faster r­cnn: Towards real­time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[39] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017.
[40] A. Seo, G.­C. Kang, J. Park, and B.­T. Zhang. Attend what you need: Motion­ appearance synergistic networks for video question answering. arXiv preprint arXiv:2106.10446, 2021.
[41] H. Tan and M. Bansal. Lxmert: Learning cross­modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[43] A.J.Wang,Y.Ge,R.Yan,Y.Ge,X.Lin,G.Cai,J.Wu,Y.Shan,X.Qie,andM.Z. Shou. All in one: Exploring unified video­language pre­training. arXiv preprint arXiv:2203.07303, 2022.
[44] J. Xiao, X. Shang, A. Yao, and T.­S. Chua. Next­qa: Next phase of question­ answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9777–9786, 2021.
[45] J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.­S. Chua. Video as conditional graph hierarchy for multi­granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2804–2812, 2022.
[46] J.Xiao,P.Zhou,T.­S.Chua,andS.Yan.Videographtransformerforvideoquestion answering. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 39–58. Springer, 2022.
[47] S.Xie,R.Girshick,P.Dollár,Z.Tu,andK.He.Aggregatedresidualtransformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
[48] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Just ask: Learning to an­swer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
[49] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Learning to answer visual questions from web videos. arXiv preprint arXiv:2205.05019, 2022.
[50] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi. Mer­ lot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
[51] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez­Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[52] Y. Zhong, W. Ji, J. Xiao, Y. Li, W. Deng, and T.­S. Chua. Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225, 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88100-
dc.description.abstract影片問答是一項具有挑戰性的任務,需要模型在豐富的影片內容中準確地識別和理解相關信息。傳統方法試圖通過考慮視覺問題的關聯性,強調特定幀中的相關信息。然而,由於缺乏因果幀的真實標籤,這種關聯性只能透過「間接地」學習,導致了「聚焦錯誤」(misfocus)的問題。為了解決這個問題,我們提出了一種新的訓練框架,稱為「空間蒸餾與因果幀定位」(Spatial distillation And Reliable Causal frame localization),利用現成的圖像問答模型來幫助影片問答模型更好地理解影片的時間和空間維度中的相關信息。具體而言,我們利用圖像問答模型的視覺問題和答案先驗知識來獲得因果幀的偽標籤,並在時間維度上「直接地」指導影片聚焦在相關聯的幀。此外,由於圖像模型具有出色的空間推理能力,我們通過知識蒸餾將這種知識轉移到影片模型中。我們的方法不依賴於特定模型,在各種基準測試中均優於以前的方法。此外,它在多個影片問答模型(包括預訓練和非預訓練模型)上都能穩定提升性能。zh_TW
dc.description.abstractVideo Question Answering (Video QA) is a challenging task that requires models to accurately identify and contextualize relevant information within abundant video contents. Conventional approaches attempt to emphasize related information in specific frames by considering the visual-question relationship. However, the absence of ground-truth of causal frames makes such a relationship can only be learned implicitly, leading to the "misfocus" issue. To address this, we propose a novel training pipeline called "Spatial distillation And Reliable Causal frame localization", which leverages an off-the-shelf image QA model to make the video QA model better grasp relevant information in temporal and spatial dimensions of the video. Specifically, we use the visual-question and answer priors from an image QA model to obtain pseudo ground-truth of causal frames and explicitly guide the video QA model in the temporal dimension. Moreover, due to the superior spatial reasoning ability of image models, we transfer such knowledge to video models via knowledge distillation. Our model-agnostic approach outperforms previous methods on various benchmarks. Besides, it consistently improves performance (up to 5%) across several video QA models, including pre-trained and non pre-trained models.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-08T16:18:02Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-08T16:18:02Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee (i)
Acknowledgements (iii)
摘要 (v)
Abstract (vii)
Contents (ix)
List of Figures (xi)
List of Tables (xiii)
Chapter 1 Introduction 1 (1)
Chapter 2 Related Work (5)
2.1 Image Question Answering (5)
2.2 Video Question Answering (6)
Chapter 3 Method (7)
3.1 Extraction of Causal Frame and Spatial Knowledge (8)
3.2 Spatial­TemporalGuidance (9)
3.2.1 Temporal Guidance (9)
3.2.2 Spatial Guidance (10)
3.2.3 Training Objectives (11)
3.3 Temporal Guided Mixup (11)
Chapter 4 Experiments (13)
4.1 Preliminary: Capability of ImageQA Model (13)
4.2 Settings (14)
4.2.1 Benchmarks (14)
4.2.2 Video Backbones (14)
4.2.3 Implementation Details (14)
4.2.3.1 Settings for Fine­tuning Image Model (15)
4.2.3.2 Settings for Training Video Models (15)
4.3 State­-of-­the-­art Comparison (16)
4.4 Analysis of Effectiveness and Applicability (18)
4.5 Ablation Studies (19)
4.5.1 Primary Ablation Experiments (19)
4.5.2 Efficacy of Temporal Guided Mixup (20)
Chapter 5 Conclusion (21)
References (23)
Appendix A — Implementation Details (31)
A.1 Fine­tuning Image Model (31)
A.2 Frames Used for Knowledge Extraction (31)
Appendix B — Experimental Results (33)
B.1 Additional Qualitative Results (33)
-
dc.language.isoen-
dc.subject多模態學習zh_TW
dc.subject機器學習zh_TW
dc.subject視覺與語言zh_TW
dc.subject知識蒸餾zh_TW
dc.subject視覺問答zh_TW
dc.subject影片問答zh_TW
dc.subjectvision and languageen
dc.subjectknowledge distilla­tionen
dc.subjectmultimodal learningen
dc.subjectvisual question answeringen
dc.subjectvideo question answeringen
dc.subjectmachine learningen
dc.title透過問題相關幀定位和空間引導改善影片理解zh_TW
dc.titleImproving Video Understanding through Reliable Question-Relevant Frame Localization and Spatial Guidanceen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee葉梅珍;陳奕廷;陳駿丞zh_TW
dc.contributor.oralexamcommitteeMei-Chen Yeh;Yi-Ting Chen;Jun-Cheng Chenen
dc.subject.keyword影片問答,視覺問答,知識蒸餾,視覺與語言,多模態學習,機器學習,zh_TW
dc.subject.keywordvideo question answering,visual question answering,knowledge distilla­tion,vision and language,multimodal learning,machine learning,en
dc.relation.page34-
dc.identifier.doi10.6342/NTU202300917-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2023-07-14-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2028-07-11-
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-111-2.pdf
  Until 2028-07-11
5.96 MBAdobe PDF
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved