透過問題相關幀定位和空間引導改善影片理解

蔡秉辰; Bing-Chen Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88100

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民	zh_TW
dc.contributor.advisor	Winston Hsu	en
dc.contributor.author	蔡秉辰	zh_TW
dc.contributor.author	Bing-Chen Tsai	en
dc.date.accessioned	2023-08-08T16:18:02Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-08	-
dc.date.issued	2023	-
dc.date.submitted	2023-07-13	-
dc.identifier.citation	[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottomup and topdown attention for image captioning and visual question an swering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018. [2] S.Antol,A.Agrawal,J.Lu,M.Mitchell,D.Batra,C.L.Zitnick,andD.Parikh.Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. [3] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. [4] H.Bao,W.Wang,L.Dong,Q.Liu,O.K.Mohammed,K.Aggarwal,S.Som,S.Piao, and F. Wei. Vlmo: Unified visionlanguage pretraining with mixtureofmodality experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022. [5] S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. FeiFei, and J. C. Niebles. Revisit ing the” video” in videolanguage understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022 [6] R. Cadene, H. BenYounes, M. Cord, and N. Thome. Murel: Multimodal rela tional reasoning for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1989–1998, 2019. [7] D. Chen, C. Tao, L. Hou, L. Shang, X. Jiang, and Q. Liu. Litevl: Efficient videolanguage learning with enhanced spatialtemporal modeling. arXiv preprint arXiv:2210.11929, 2022. [8] Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu. Languagebridged spatial temporal interaction for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4964– 4973, 2022. [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [10] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang. Heteroge neous memory enhanced multimodal attention model for video question answer ing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007, 2019. [11] T.J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu. Violet: Endtoend videolanguage transformers with masked visualtoken modeling. arXiv preprint arXiv:2111.12681, 2021. [12] T.J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu. An empirical study of endtoend videolanguage transformers with masked visual modeling. arXiv preprint arXiv:2209.01540, 2022. [13] Z. Gan, Y.C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu. Largescale adversar ial training for visionandlanguage representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020. [14] J. Gao, R. Ge, K. Chen, and R. Nevatia. Motionappearance comemory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6576–6585, 2018. [15] M. GrundeMcLaughlin, R. Krishna, and M. Agrawala. AGQA 2.0: An up dated benchmark for compositional spatiotemporal reasoning. arXiv preprint arXiv:2204.06105, 2022. [16] Z. Guo, J. Zhao, L. Jiao, X. Liu, and L. Li. Multiscale progressive attention net work for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 973– 978, 2021. [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [18] P. Jiang and Y. Han. Reasoning with heterogeneous graph alignment for video ques tion answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11109–11116, 2020. [19] A. Kae and Y. Song. Image to video domain adaptation using web supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 567–575, 2020. [20] W. Kim, B. Son, and I. Kim. Vilt: Visionandlanguage transformer without con volution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021. [21] T. M. Le, V. Le, S. Venkatesh, and T. Tran. Hierarchical conditional relation net works for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9972–9981, 2020. [22] H.Y. Lee, H.T. Su, B.C. Tsai, T.H. Wu, J.F. Yeh, and W. H. Hsu. Learning fine grained visual understanding for video question answering via decoupling spatial temporal modeling. arXiv preprint arXiv:2210.03941, 2022. [23] J.Lei,T.L.Berg,andM.Bansal.Revealingsingleframebiasforvideoandlanguage learning. arXiv preprint arXiv:2206.03428, 2022. [24] J.Li,R.Selvaraju,A.Gotmare,S.Joty,C.Xiong,andS.C.H.Hoi.Alignbeforefuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. [25] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli. Attention transfer from web images for video recognition. In Proceedings of the 25th ACM international conference on multimedia, pages 1–9, 2017. [26] L. Li, Y.C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu. Hero: Hierarchi cal encoder for video+ language omnirepresentation pretraining. arXiv preprint arXiv:2005.00200, 2020. [27] L.Li,Z.Gan,Y.Cheng,andJ.Liu.Relationawaregraphattentionnetworkforvisual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10313–10322, 2019. [28] L. H. Li, M. Yatskar, D. Yin, C.J. Hsieh, and K.W. Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. [29] X.Li,J.Song,L.Gao,X.Liu,W.Huang,X.He,andC.Gan.Beyondrnns:Positional selfattention with coattention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8658–8665, 2019. [30] Y. Li, X. Wang, J. Xiao, and T.S. Chua. Equivariant and invariant grounding for video question answering. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4714–4722, 2022. [31] Y.Li,X.Wang,J.Xiao,W.Ji,andT.S.Chua.Invariantgroundingforvideoquestion answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937, 2022. [32] W. Lin, A. Kukleva, K. Sun, H. Possegger, H. Kuehne, and H. Bischof. Cycda: Unsupervised cycle domain adaptation to learn from image to video. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pages 698–715. Springer, 2022. [33] F.Liu,J.Liu,W.Wang,andH.Lu.Hair:Hierarchicalvisualsemanticrelationalrea soning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1698–1707, 2021. [34] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou. Univl: A unified video and language pretraining model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020. [35] W. NorcliffeBrown, S. Vafeias, and S. Parisot. Learning conditioned graph struc tures for interpretable visual question answering. Advances in neural information processing systems, 31, 2018. [36] L.Peng,S.Yang,Y.Bin,andG.Wang.Progressivegraphattentionnetworkforvideo question answering. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2871–2879, 2021. [37] A. Piergiovanni, K. Morton, W. Kuo, M. S. Ryoo, and A. Angelova. Video question answering with iterative videotext cotokenization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 76–94. Springer, 2022. [38] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. Advances in neural information processing systems, 28, 2015. [39] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017. [40] A. Seo, G.C. Kang, J. Park, and B.T. Zhang. Attend what you need: Motion appearance synergistic networks for video question answering. arXiv preprint arXiv:2106.10446, 2021. [41] H. Tan and M. Bansal. Lxmert: Learning crossmodality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [43] A.J.Wang,Y.Ge,R.Yan,Y.Ge,X.Lin,G.Cai,J.Wu,Y.Shan,X.Qie,andM.Z. Shou. All in one: Exploring unified videolanguage pretraining. arXiv preprint arXiv:2203.07303, 2022. [44] J. Xiao, X. Shang, A. Yao, and T.S. Chua. Nextqa: Next phase of question answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9777–9786, 2021. [45] J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.S. Chua. Video as conditional graph hierarchy for multigranular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2804–2812, 2022. [46] J.Xiao,P.Zhou,T.S.Chua,andS.Yan.Videographtransformerforvideoquestion answering. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 39–58. Springer, 2022. [47] S.Xie,R.Girshick,P.Dollár,Z.Tu,andK.He.Aggregatedresidualtransformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017. [48] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021. [49] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Learning to answer visual questions from web videos. arXiv preprint arXiv:2205.05019, 2022. [50] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi. Mer lot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021. [51] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. [52] Y. Zhong, W. Ji, J. Xiao, Y. Li, W. Deng, and T.S. Chua. Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88100	-
dc.description.abstract	影片問答是一項具有挑戰性的任務，需要模型在豐富的影片內容中準確地識別和理解相關信息。傳統方法試圖通過考慮視覺問題的關聯性，強調特定幀中的相關信息。然而，由於缺乏因果幀的真實標籤，這種關聯性只能透過「間接地」學習，導致了「聚焦錯誤」(misfocus)的問題。為了解決這個問題，我們提出了一種新的訓練框架，稱為「空間蒸餾與因果幀定位」(Spatial distillation And Reliable Causal frame localization)，利用現成的圖像問答模型來幫助影片問答模型更好地理解影片的時間和空間維度中的相關信息。具體而言，我們利用圖像問答模型的視覺問題和答案先驗知識來獲得因果幀的偽標籤，並在時間維度上「直接地」指導影片聚焦在相關聯的幀。此外，由於圖像模型具有出色的空間推理能力，我們通過知識蒸餾將這種知識轉移到影片模型中。我們的方法不依賴於特定模型，在各種基準測試中均優於以前的方法。此外，它在多個影片問答模型(包括預訓練和非預訓練模型)上都能穩定提升性能。	zh_TW
dc.description.abstract	Video Question Answering (Video QA) is a challenging task that requires models to accurately identify and contextualize relevant information within abundant video contents. Conventional approaches attempt to emphasize related information in specific frames by considering the visual-question relationship. However, the absence of ground-truth of causal frames makes such a relationship can only be learned implicitly, leading to the "misfocus" issue. To address this, we propose a novel training pipeline called "Spatial distillation And Reliable Causal frame localization", which leverages an off-the-shelf image QA model to make the video QA model better grasp relevant information in temporal and spatial dimensions of the video. Specifically, we use the visual-question and answer priors from an image QA model to obtain pseudo ground-truth of causal frames and explicitly guide the video QA model in the temporal dimension. Moreover, due to the superior spatial reasoning ability of image models, we transfer such knowledge to video models via knowledge distillation. Our model-agnostic approach outperforms previous methods on various benchmarks. Besides, it consistently improves performance (up to 5%) across several video QA models, including pre-trained and non pre-trained models.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-08T16:18:02Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-08T16:18:02Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee (i) Acknowledgements (iii) 摘要 (v) Abstract (vii) Contents (ix) List of Figures (xi) List of Tables (xiii) Chapter 1 Introduction 1 (1) Chapter 2 Related Work (5) 2.1 Image Question Answering (5) 2.2 Video Question Answering (6) Chapter 3 Method (7) 3.1 Extraction of Causal Frame and Spatial Knowledge (8) 3.2 SpatialTemporalGuidance (9) 3.2.1 Temporal Guidance (9) 3.2.2 Spatial Guidance (10) 3.2.3 Training Objectives (11) 3.3 Temporal Guided Mixup (11) Chapter 4 Experiments (13) 4.1 Preliminary: Capability of ImageQA Model (13) 4.2 Settings (14) 4.2.1 Benchmarks (14) 4.2.2 Video Backbones (14) 4.2.3 Implementation Details (14) 4.2.3.1 Settings for Finetuning Image Model (15) 4.2.3.2 Settings for Training Video Models (15) 4.3 State-of-the-art Comparison (16) 4.4 Analysis of Effectiveness and Applicability (18) 4.5 Ablation Studies (19) 4.5.1 Primary Ablation Experiments (19) 4.5.2 Efficacy of Temporal Guided Mixup (20) Chapter 5 Conclusion (21) References (23) Appendix A — Implementation Details (31) A.1 Finetuning Image Model (31) A.2 Frames Used for Knowledge Extraction (31) Appendix B — Experimental Results (33) B.1 Additional Qualitative Results (33)	-
dc.language.iso	en	-
dc.subject	多模態學習	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	視覺與語言	zh_TW
dc.subject	知識蒸餾	zh_TW
dc.subject	視覺問答	zh_TW
dc.subject	影片問答	zh_TW
dc.subject	vision and language	en
dc.subject	knowledge distillation	en
dc.subject	multimodal learning	en
dc.subject	visual question answering	en
dc.subject	video question answering	en
dc.subject	machine learning	en
dc.title	透過問題相關幀定位和空間引導改善影片理解	zh_TW
dc.title	Improving Video Understanding through Reliable Question-Relevant Frame Localization and Spatial Guidance	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	葉梅珍;陳奕廷;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Mei-Chen Yeh;Yi-Ting Chen;Jun-Cheng Chen	en
dc.subject.keyword	影片問答,視覺問答,知識蒸餾,視覺與語言,多模態學習,機器學習,	zh_TW
dc.subject.keyword	video question answering,visual question answering,knowledge distillation,vision and language,multimodal learning,machine learning,	en
dc.relation.page	34	-
dc.identifier.doi	10.6342/NTU202300917	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-07-14	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2028-07-11	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 此日期後於網路公開 2028-07-11	5.96 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。