多模態影片理解及其擴展

蘇弘庭; Hung-Ting Su

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88611

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民	zh_TW
dc.contributor.advisor	Winston H. Hsu	en
dc.contributor.author	蘇弘庭	zh_TW
dc.contributor.author	Hung-Ting Su	en
dc.date.accessioned	2023-08-15T17:03:23Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-15	-
dc.date.issued	2023	-
dc.date.submitted	2023-07-30	-
dc.identifier.citation	[1] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran. Self supervised learning by crossmodal audiovideo clustering. In NeurIPS, 2020. [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015. [3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016. [4] R. G. B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019. [5] R. G. B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019. [6] D. Bahdanau, S. Murty, M. Noukhovitch, T. H. Nguyen, H. de Vries, and A. Courville. Systematic generalization: What is required and can it be learned? In ICLR, 2019. [7] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. [8] E. Barati and X. Chen. Criticbased attention network for eventbased video cap tioning. In ACM Multimedia, 2019. [9] Y. Bengio. From system 1 deep learning to system 2 deep learning. NeuripS, 2019. [10] S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPTNeo: Large Scale Autoregressive Language Modeling with MeshTensorflow, Mar. 2021. If you use this software, please cite it using these metadata. [11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee lakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. HerbertVoss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Rad ford, I. Sutskever, and D. Amodei. Language models are fewshot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. [12] I. Calixto and Q. Liu. Incorporating global visual features into attentionbased neural machine translation. In EMNLP, 2017. [13] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in selfsupervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021. [14] C.H. Chang, H.T. Su, J. Hsu, Y.S. Wang, Y.C. Chang, Z. Y. Liu, Y.L. Chang, W.F. Cheng, K.J. Wang, and W. H. Hsu. Situation and behavior understanding by trope detection on films. In WWW, 2021. [15] D. Chen and W. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011. [16] S. Chen, J. Chen, Q. Jin, and A. Hauptmann. Video captioning with guidance of multimodal latent topics. In ACM Multimedia, 2017. [17] A. Cherian, C. Hori, T. K. Marks, and J. Le Roux. (2.5+ 1) d spatiotemporal scene graphs for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 444–453, 2022. [18] E. Choi, H. He, M. Iyyer, M. Yatskar, W.t. Yih, Y. Choi, P. Liang, and L. Zettle moyer. QuAC: Question answering in context. In EMNLP, 2018. [19] K. Clark and C. D. Manning. Deep reinforcement learning for mentionranking coreference models. In EMNLP, 2016. [20] K. Clark and C. D. Manning. Improving coreference resolution by learning entity level distributed representations. In ACL, 2016. [21] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A Large Scale Hierarchical Image Database. In CVPR, 2009. [22] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A large scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [23] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In NAACLHLT, 2019. [24] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In NAACLHLT, 2019. [25] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.W. Hon. Unified language model pretraining for natural language understanding and generation. In NIPS, 2019. [26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020. [27] J. Dunietz, G. Burnham, A. Bharadwaj, O. Rambow, J. ChuCarroll, and D. Fer rucci. To test machine comprehension, start by defining comprehension. In ACL, 2020. [28] A. Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL, 8, 2020. [29] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. Activitynet: A largescale video benchmark for human activity understanding. In CVPR, 2015. [30] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In CVPR, 2019. [31] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Generating attractive visual captions with styles. In CVPR, 2017. [32] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, 2017. [33] J. Gao, R. Ge, K. Chen, and R. Nevatia. Motionappearance comemory networks for video question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6576–6585, Los Alamitos, CA, USA, jun 2018. IEEE Computer Society. [34] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for lan guage modeling. arXiv preprint arXiv:2101.00027, 2020. [35] E. Goodwin, K. Sinha, and T. J. O’Donnell. Probing linguistic systematicity. In ACL, 2020. [36] A. C. Graesser, M. Singer, and T. Trabasso. Constructing inferences during narra tive text comprehension. Psychological Review, 1994. [37] L. Gui, B. Wang, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao. Kat: A knowledge augmented transformer for visionandlanguage. In NAACL, 2022. [38] Y. Han, B. Wang, R. Hong, and F. Wu. Movie question answering via textual memory and plot graph. IEEE Transactions on Circuits and Systems for Video Technology, 30(3):875–887, 2020. [39] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6546–6555, 2018. [40] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. [41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015. [42] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recog nition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, pages 770–778. IEEE, June 2016. [43] M. Heilman and N. A. Smith. Question generation via overgenerating transforma tions and ranking. Technical report, CarnegieMellon Univ Pittsburgh pa language technologies insT, 2009. [44] M. Heilman and N. A. Smith. Good question! statistical ranking for question gen eration. In NAACLHLT, 2010. [45] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8), 1997. [46] M. Honnibal and I. Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017. [47] P. Hosseini, D. A. Broniatowski, and M. Diab. Knowledgeaugmented language models for causeeffect relation classification. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 43–48, Dublin, Ireland, May 2022. Association for Computational Linguistics. [48] Y. Hu, Z. Chen, Z.J. Zha, and F. Wu. Hierarchical globallocal temporal modeling for video captioning. In ACM Multimedia, 2019. [49] D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan. Locationaware graph convolutional networks for video question answering. In AAAI, 2020. [50] J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi. Cometatomic 2020: On symbolic and neural commonsense knowledge graphs. In AAAI, 2021. [51] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. TGIFQA: toward spatiotemporal reasoning in visual question answering. In CVPR, 2017. [52] P.JiangandY.Han.Reasoningwithheterogeneousgraphalignmentforvideoques tion answering. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):11109–11116, Apr. 2020. [53] Q. Jin, S. Chen, J. Chen, and A. Hauptmann. Knowing yourself: Improving video caption via indepth recap. In ACM Multimedia, 2017. [54] W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao, and Y. Zhuang. Multiinteraction network with object relation for video question answering. In ACM Multimedia, 2019. [55] G. Jiyang, G. Runzhou, C. Kan, and N. Ram. MotionAppearance CoMemory Networks for Video Question Answering. In CVPR, 2018. [56] S. Kar, S. Maharjan, A. P. LópezMonroy, and T. Solorio. MPST: A corpus of movie plot synopses with tags. In LREC, 2018. [57] S. Kar, S. Maharjan, A. P. LópezMonroy, and T. Solorio. MPST: A corpus of movie plot synopses with tags. In LREC, 2018. [58] S. Kar, S. Maharjan, and T. Solorio. Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In COLING, 2018. [59] W.Kay,J.Carreira,K.Simonyan,B.Zhang,C.Hillier,S.Vijayanarasimhan,F.Vi ola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. [60] K. Kim, M. Heo, S. Choi, and B. Zhang. Deepstory: Video story QA by deep embedded memory networks. In IJCAI, 2017. [61] K.M. Kim, M.O. Heo, S.H. Choi, and B.T. Zhang. Deepstory: Video story qa by deep embedded memory networks. In IJCAI, 2017. [62] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [63] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. OpenNMT: Opensource toolkit for neural machine translation. In ACL, 2017. [64] R. Krishna, M. Bernstein, and L. FeiFei. Information maximizing visual question generation. In CVPR, 2019. [65] R.Krishna,Y.Zhu,O.Groth,J.Johnson,K.Hata,J.Kravitz,S.Chen,Y.Kalantidis, L.J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In International journal of computer vision, 2017. [66] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. RACE: Largescale ReAding com prehension dataset from examinations. In EMNLP, 2017. [67] B. M. Lake and M. Baroni. Generalization without Systematicity: On the Compo sitional Skills of SequencetoSequence Recurrent Networks. In ICML, 2018. [68] J.Lehmann,R.Isele,M.Jakob,A.Jentzsch,D.Kontokostas,P.N.Mendes,S.Hell mann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia a largescale, multilingual knowledge base extracted from wikipedia. Semantic web, 6, 2015. [69] J. Lei, L. Yu, M. Bansal, and T. Berg. Tvqa: Localized, compositional video ques tion answering. In EMNLP, 2018. [70] J. Lei, L. Yu, M. Bansal, and T. Berg. TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [71] J. Lei, L. Yu, M. Bansal, and T. Berg. TVQA: Localized, compositional video question answering. In EMNLP, 2018. [72] J. Lei, L. Yu, T. Berg, and M. Bansal. TVQA+: Spatiotemporal grounding for video question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8211–8225, Online, July 2020. Association for Computational Linguistics. [73] J. Li, L. Niu, and L. Zhang. From representation to reasoning: Towards both evi dence and commonsense reasoning for video questionanswering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022. [74] X.Li,L.Gao,X.Wang,W.Liu,X.Xu,H.T.Shen,andJ.Song.Learnableaggregat ing net with diversity learning for video question answering. In ACM Multimedia, 2019. [75] X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan. Beyond rnns: Po sitional selfattention with coattention for video question answering. In AAAI, 2019. [76] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In CVPR, 2018. [77] C.Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004. [78] X. Lin, G. Bertasius, J. Wang, S.F. Chang, D. Parikh, and L. Torresani. Vx2text: Endtoend learning of videobased text generation from multimodal inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7005–7015, 2021. [79] X.Lin,F.Petroni,G.Bertasius,M.Rohrbach,S.F.Chang,andL.Torresani.Learn ing to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13853–13863, 2022. [80] T. Linzen. How can we accelerate progress towards humanlike linguistic general ization? In ACL, 2020. [81] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun. Ivqa: Inverse visual question answering. In CVPR, 2018. [82] J. Liu, W. Chen, Y. Cheng, Z. Gan, L. Yu, Y. Yang, and J. Liu. Violin: A largescale dataset for videoandlanguage inference. In CVPR, 2020. [83] J. Liu, W. Chen, Y. Cheng, Z. Gan, L. Yu, Y. Yang, and J. Liu. Violin: A largescale dataset for videoandlanguage inference. In CVPR, 2020. [84] S. Liu, Z. Ren, and J. Yuan. Sibnet: Sibling convolutional encoder for video cap tioning. In ACM Multimedia, 2018. [85] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [86] X. Long, C. Gan, and G. de Melo. Video captioning with multifaceted attention. Transactions of the Association for Computational Linguistics, 6:173–184, 2018. [87] R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969, 2019. [88] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre sentations of words and phrases and their compositionality. In NeurIPS. 2013. [89] R. B. Palm, U. Paquet, and O. Winther. Recurrent relational networks. In NeurIPS, 2018. [90] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and trans lation to bridge video and language. In CVPR, 2016. [91] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with transferred semantic attributes. In CVPR, 2017. [92] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. [93] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Des maison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. [94] T. Pedersen, S. Patwardhan, and J. Michelizzi. Wordnet::similarity: Measuring the relatedness of concepts. In NAACL, 2004. [95] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word repre sentation. In EMNLP, 2014. [96] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettle moyer. Deep contextualized word representations. In NAACLHLT, 2018. [97] M. E. Peters, S. Ruder, and N. A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. In RepL4NLP, 2019. [98] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019. [99] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified texttotext transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. [100] P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for SQuAD. In ACL, 2018. [101] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. [102] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [103] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. [104] S. Reddy, D. Chen, and C. D. Manning. CoQA: A conversational question an swering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. [105] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NeurIPS, 2015. [106] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017. [107] A. Rosenbaum, S. Soltan, W. Hamza, Y. Versley, and M. Boese. LINGUIST: Language model instruction tuning to generate annotated utterances for intent clas sification and slot tagging. In Proceedings of the 29th International Conference on Computational Linguistics, pages 218–241, Gyeongju, Republic of Korea, Oct. 2022. International Committee on Computational Linguistics. [108] G. Sahu, P. Rodriguez, I. Laradji, P. Atighehchian, D. Vazquez, and D. Bahdanau. Data augmentation for intent classification with offtheshelf large language mod els. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 47–57, Dublin, Ireland, May 2022. Association for Computational Linguistics. [109] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension, 2017. [110] X. Shi, J. Cai, S. Joty, and J. Gu. Watch it twice: Video captioning with a refocused video encoder. In ACM Multimedia, 2019. [111] K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton. CLUTRR: A diag nostic benchmark for inductive reasoning from text. In EMNLPIJCNLP, 2019. [112] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota. Harnessing a.i. for augmenting creativity: Application to movie trailer creation. In ACM MM, 2017. [113] H.T. Su, C.H. Chang, P.W. Shen, Y.S. Wang, Y.L. Chang, Y.C. Chang, P. J. Cheng, and W. H. Hsu. Endtoend video questionanswer generation with generatorpretester network. IEEE Transactions on Circuits and Systems for Video Technology, 31(11):4497–4507, 2021. [114] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. Endtoend memory networks. In NIPS. 2015. [115] P. Tang, H. Wang, H. Wang, and K. Xu. Richer semantic visual and language representation for video captioning. In ACM Multimedia, 2017. [116] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through questionanswering. In CVPR, 2016. [117] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fi dler. MovieQA: Understanding stories in movies through questionanswering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [118] A.Trischler,T.Wang,X.Yuan,J.Harris,A.Sordoni,P.Bachman,andK.Suleman. Newsqa: A machine comprehension dataset. In Rep4NLP, 2017. [119] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017. [120] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017. [121] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS. 2017. [122] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensusbased image description evaluation. In CVPR, 2015. [123] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. In ICCV, 2015. [124] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multitask benchmark and analysis platform for natural language understanding. In ICLR, 2019. [125] H. Wang, Y. Xu, and Y. Han. Spotting and aggregating salient regions for video captioning. In ACM Multimedia, 2018. [126] J.Wang,W.Wang,Y.Huang,L.Wang,andT.Tan.Hierarchicalmemorymodelling for video captioning. In ACM Multimedia, 2018. [127] Y.S. Wang, H.T. Su, C.H. Chang, Z.Y. Liu, and W. Hsu. Video question gener ation via crossmodal selfattention networks learning. In ICASSP, 2020. [128] Y.S. Wang, H.T. Su, C.H. Chang, Z.Y. Liu, and W. H. Hsu. Video question generation via semantic rich crossmodal selfattention networks learning. In ICASSP 2020 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2423–2427, 2020. [129] Z.Wang,M.Li,R.Xu,L.Zhou,J.Lei,X.Lin,S.Wang,Z.Yang,C.Zhu,D.Hoiem, et al. Language models with image descriptors are strong fewshot videolanguage learners. arXiv preprint arXiv:2205.10747, 2022. [130] J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multihop reading comprehension across documents. TACL, 6, 2018. [131] T. Winterbottom, S. Xiao, A. McLean, and N. A. Moubayed. On modality bias in the tvqa dataset. In BMVC, 2020. [132] J. Xiao, X. Shang, A. Yao, and T.S. Chua. NExTQA: Next phase of question answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [133] J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.S. Chua. Video as conditional graph hierarchy for multigranular question answering. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), pages 2804–2812, 2022. [134] J.Xiao,P.Zhou,T.S.Chua,andS.Yan.Videographtransformerforvideoquestion answering. In European Conference on Computer Vision, pages 39–58. Springer, 2022. [135] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speedaccuracy tradeoffs in video classification. In ECCV, 2018. [136] Y. Xiong, B. Dai, and D. Lin. Move forward and tell: A progressive generator of video descriptions. In ECCV, 2018. [137] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. [138] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. [139] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. [140] J. Xu, T. Mei, T. Yao, and Y. Rui. Msrvtt: A large video description dataset for bridging video and language. In CVPR, 2016. [141] J. Xu, T. Mei, T. Yao, and Y. Rui. Msrvtt: A large video description dataset for bridging video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [142] J. Xu, T. Yao, Y. Zhang, and T. Mei. Learning multimodal attention lstm networks for video captioning. In ACM Multimedia, 2017. [143] N.Xu,A.Liu,Y.Wong,Y.Zhang,W.Nie,Y.Su,andM.Kankanhalli.Dualstream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2482–2493, 2019. [144] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. Stat: Spatialtemporal attention mechanism for video captioning. IEEE Transactions on Multimedia, 22(1):229–241, 2020. [145] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Just ask: Learning to an swer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021. [146] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Learning to answer visual questions from web videos. IEEE transactions on pattern analysis and machine intelligence, PP, 2022. [147] J. Yang, Y. Zhu, Y. Wang, R. Yi, A. Zadeh, and L.P. Morency. What gives the answer away? question answering bias analysis on video qa datasets. In Human Multimodal Language Workshop, 2020. [148] T. Yang, Z.J. Zha, H. Xie, M. Wang, and H. Zhang. Questionaware tubeswitch network for video question answering. In ACM Multimedia, 2019. [149] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS. 2019. [150] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang. An empirical study of gpt3 for fewshot knowledgebased vqa. In AAAI, 2022. [151] Z. Yang, Y. Han, and Z. Wang. Catching the temporal regionsofinterest for video captioning. In ACMn Multimedia, 2017. [152] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Man ning. HotpotQA: A dataset for diverse, explainable multihop question answering. In EMNLP, 2018. [153] Z. Yang, Y. Xu, H. Wang, B. Wang, and Y. Han. Multirate multimodal video cap tioning. In ACM Multimedia, 2017. [154] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015. [155] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020. [156] X. Yin and V. Ordonez. Obj2text: Generating visually descriptive language from object layouts. In EMNLP, 2017. [157] A. W. Yu, D. Dohan, Q. Le, T. Luong, R. Zhao, and K. Chen. Fast and accurate reading comprehension by combining selfattention and convolution. In ICLR, 2018. [158] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao. Activitynetqa: A dataset for understanding complex web videos via question answering. In AAAI, 2019. [159] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao. ActivityNetQA: A dataset for understanding complex web videos via question answering. 2019. [160] A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L.P. Morency. Socialiq: A question answering benchmark for artificial social intelligence. In CVPR, 2019. [161] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A largescale adversarial dataset for grounded commonsense inference. In EMNLP, 2018. [162] K.H. Zeng, T.H. Chen, C.Y. Chuang, Y.H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. In AAAI, 2017. [163] K.H. Zeng, T.H. Chen, C.Y. Chuang, Y.H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. [164] K.H. Zeng, T.H. Chen, C.Y. Chuang, Y.H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. In AAAI, 2017. [165] H. Zhang, H. Liang, Y. Zhang, L.M. Zhan, X.M. Wu, X. Lu, and A. Lam. Finetuning pretrained language models for fewshot intent detection: Supervised pretraining and isotropization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 532–542, Seattle, United States, July 2022. Associ ation for Computational Linguistics. [166] W. Zhang, S. Tang, Y. Cao, S. Pu, F. Wu, and Y. Zhuang. Frame augmented al ternating attention network for video question answering. IEEE Transactions on Multimedia, 22(4):1032–1041, 2020. [167] Z. Zhao, J. Lin, X. Jiang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical duallevel attention network learning. In ACM Multimedia, 2017. [168] Z. Zhao, Q. Yang, D. Cai, X. He, Y. Zhuang, Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical spatiotemporal attention networks. In IJCAI, 2017. [169] Z.Zhao,Z.Zhang,S.Xiao,Z.Yu,J.Yu,D.Cai,F.Wu,andY.Zhuang.Openended longform video question answering via adaptive hierarchical reinforced networks. In IJCAI, 2018. [170] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. Endtoend dense video captioning with masked transformer. In CVPR, 2018. [171] Y. Zhu and S. Jiang. Attentionbased densely connected lstm for video captioning. In ACM Multimedia, 2019.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88611	-
dc.description.abstract	機器學習模型的多模式影片理解對應了人類的視覺和文字的感知，對於各種應用至關重要。然而，以往基於監督學習的研究面臨兩個根本性的挑戰:(1)需要大量的人工標注數據和(2)缺乏進行系統二推理的能力。本研究提出三種創新的解決方法來應對以上挑戰。首先，為了降低數據標注成本，本研究提出影片問答生成這樣新任務，該任務的目標為自動生成用於訓練影片問答系統的問題與答案。與以往依賴於文字描述的問答生成方法不同，本研究直接輸入影片並避免了訊息遺失。為了解決影片問答生成問題，我們設計了一個生成器預測器網絡，通過試圖回答問題來鼓勵模型輸出可回答的問題。其次，我們提出了一種解決因果影片問答問題的方案:從語言模型中提取因果常識。不同於傳統僅依賴於視覺觀察的影片問答，因果影片問答整合了常識，以進行更複雜的推理。現有基於文字描述的問答方法僅限於提取關聯知識，因此不適用於因果影片問答。為了解決這個挑戰，我們利用在訓練過程中學習到廣泛因果關係的語言模型來提取常識。我們將意圖動作的配對輸入語言模型得到回應後，然後將其轉化為問題答案的配對，並將之用於訓練影片問答系統。第三，為了檢驗和發展機器的系統二推理能力，我們提出了橋段理解這樣新任務。做為創作中的敘事工具，理解橋段需要對於因果和動機的推理能力。我們收集了兩個橋段理解資料集，並於實驗中發現先進模型與人類表現之間的顯著差距。為了解決這個問題，我們提出了兩種方法:(1)多層次理解模型以理解時間(故事情節)和空間(角色關係)兩個維度。(2)角色設定理解和說故事模型，通過學習在潛在空間中對視覺和文字描述特徵進行對齊，以學習文字描述中人進行的詮釋。實驗結果證明了我們提出方法的有效性，在影片問答生成、橋段理解和零樣本因果影片問答超越了前沿的方法。此外，我們提供了詳細的分析以開闢新的研究路徑。	zh_TW
dc.description.abstract	Understanding multi-modal video inputs, reflecting human visual and textual cognition, is crucial for various applications. However, previous supervised learning studies have faced two fundamental challenges: (1) the need for significant human annotation and (2) the lack of ability to perform system 2level reasoning. In this research, we propose three novel components to address these challenges. First, to lower the cost of data annotation, we present a task called Video Question Answer Generation (VQAG), which automatically generates questionanswer pairs for training Video QA systems. Unlike previous QAgeneration methods that rely on captions, we directly input video and avoid information loss. To address VQAG, we design a GeneratorPretester Network that encourages the model to output answerable questions by attempting to answer them. Secondly, we propose a solution for tackling causal Video QA by extracting causal commonsense knowledge from language models. Unlike traditional Video QA that solely relies on visual observations, Causal Video QA integrates commonsense knowledge for more sophisticated reasoning. Existing captionbased QA methods are limited to extract ing association knowledge, making them unsuitable for causal Video QA. To address this challenge, we utilize language models that have observed vast causal relations during train ing to extract commonsense knowledge. We prompt the models with intentionaction pairs and extract responses, which are then transformed into questionanswer pairs. These pairs can be used to train Video QA systems. Third, we propose a novel task, Trope Understanding, to examine and develop machines’ System 2 reasoning capabilities. Understanding storytelling devices called tropes requires causal and motivational reasoning skills. We collect two movie trope understanding datasets and highlight the significant gaps between stateoftheart models and human performance. To address this, we propose two methods: (1) Multilevel Comprehension Model, which comprehends both temporal (storyline) and spatial (character relations) dimensions. (2) Trope Understanding and Storytelling Model leverage human interpretation by learning to align visual and textual features in a latent space. Experimental results demonstrate the effectiveness of our proposed components, out performing traditional stateoftheart methods on Video Question Generation, Trope Understanding, and ZeroShot Causal Video Question Answering. Moreover, we provide detailed analysis to pave the way for future work.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:03:23Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-15T17:03:23Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Chapter 1 Introduction 1 Chapter 2 Video QuestionAnswer Generation (VQAG) 5 2.1 Introduction 5 2.2 Related Work 10 2.3 GeneratorPretester Network (GPN) 12 2.3.1 Joint QuestionAnswer Generator (JQAG) 13 2.3.1.1 Video Encoder 13 2.3.1.2 Answer Selector 15 2.3.1.3 Question Generator 15 2.3.2 Pretester (PT) 16 2.4 Question Generation Experiments 18 2.4.1 Experimental Setup 18 2.4.1.1 Data 18 2.4.1.2 Evaluation Metrics 19 2.4.1.3 Implementation Detail 19 2.4.2 Question Generation Performance 20 2.4.2.1 ActivitynetQA Results 20 2.4.2.2 TVQA Results 21 2.4.3 Ablation Study 21 2.4.4 Case Study 24 2.4.5 Error Analysis 25 2.5 Application in Question Answering 27 2.5.1 Setup 27 2.5.2 Question Answering Performance 28 2.5.3 Pretraining Data Analysis 29 2.6 Conclusion 30 Chapter 3 Causal Knowledge Extraction from Language Models (CaKE-LM) 33 3.1 Introduction 33 3.2 Related Work 36 3.2.1 Causal Video Question Answering 36 3.2.2 QA Generation for Video QA 37 3.2.3 Language Models Adaptation 38 3.3 Approach 38 3.3.1 Causal Knowledge from Language Models 39 3.3.2 Knowledge Extraction by Prompting 40 3.3.3 QuestionAnswer Generation 41 3.3.4 Distillation with LM Answer 42 3.4 Experiments 43 3.4.1 Setup 43 3.4.2 Video QA Performance 46 3.4.3 Language Model Analysis 48 3.4.4 Error Analysis 50 3.4.5 Potential Extensions 51 3.5 Conclusion 52 Chapter 4 Trope in Movie Synopses (TiMoS) 53 4.2 Related Works 57 4.2.1 Contextual Embedding 57 4.2.2 Machine Comprehension Dataset 58 4.2.3 Films Related Works 59 4.3 New Dataset: TiMoS 60 4.3.1 Overview 60 4.3.2 Data Collection 61 4.3.3 Data Analysis 62 4.3.4 Challenges 64 4.4 Method 66 4.4.1 Task Formulation 66 4.4.2 MultiLevel Comprehension Network 66 4.4.3 MultiStep Recurrent Relational Network 69 4.5 Experiments 72 4.5.1 Baselines 72 4.5.2 Streams 73 4.6 Results and Analysis 74 4.6.1 Baseline Comparison 74 4.6.2 Trope Difficulty 75 4.6.3 Knowledge Adaption 76 4.6.4 Qualitative Analysis 78 4.7 Human Evaluation and Discussion 79 4.7.1 Human Inference Process 79 4.7.2 Setup 79 4.7.3 Human Performance 80 4.7.4 Potential Directions 81 4.8 Conclusion 81 Chapter 5 Trope Understanding in Movies and Animations (TrUMAn) 83 5.1 Introduction 83 5.2 Impact and Potential Extensions 86 5.3 Related Work 88 5.4 TrUMAn Dataset 90 5.4.1 Overview 90 5.4.2 Data Collection 92 5.4.3 Data Analysis 92 5.4.4 Human Evaluation on TrUMAn 93 5.4.5 Data Availability 95 5.5 Trope Understanding and Storytelling (TrUSt) Model 96 5.5.1 Video Encoder 97 5.5.2 Conceptual Storyteller 98 5.5.3 Trope Understanding 99 5.6 Experiments 101 5.6.1 Modality 101 5.6.2 Compared Methods 102 5.6.3 Results and Discussion 103 5.7 Conclusion 107 Chapter 6 Conclusion 109 References 111	-
dc.language.iso	en	-
dc.subject	影片問題生成	zh_TW
dc.subject	影片問答	zh_TW
dc.subject	橋段理解	zh_TW
dc.subject	零樣本學習	zh_TW
dc.subject	Video Question Answering	en
dc.subject	Zero-Shot Learning	en
dc.subject	Video Question Generation	en
dc.subject	Trope Understanding	en
dc.title	多模態影片理解及其擴展	zh_TW
dc.title	Multi-modal Video Comprehension and Beyond	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	博士	-
dc.contributor.oralexamcommittee	鄭卜壬;賴尚宏;林嘉文;彭文志;王蒞君;陳信希;李政德;李宏毅	zh_TW
dc.contributor.oralexamcommittee	Pu-Jen Cheng;Shang-Hong Lai;Chia-Wen Lin;Wen-Chih Peng;Li-Chun Wang;Hsin-Hsi Chen;Cheng-Te Li;Hung-Yi Lee	en
dc.subject.keyword	影片問答,影片問題生成,零樣本學習,橋段理解,	zh_TW
dc.subject.keyword	Video Question Answering,Video Question Generation,Zero-Shot Learning,Trope Understanding,	en
dc.relation.page	131	-
dc.identifier.doi	10.6342/NTU202302209	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-01	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	21.15 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。