Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88611
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor徐宏民zh_TW
dc.contributor.advisorWinston H. Hsuen
dc.contributor.author蘇弘庭zh_TW
dc.contributor.authorHung-Ting Suen
dc.date.accessioned2023-08-15T17:03:23Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-15-
dc.date.issued2023-
dc.date.submitted2023-07-30-
dc.identifier.citation[1] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran. Self­ supervised learning by cross­modal audio­video clustering. In NeurIPS, 2020.
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
[3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016.
[4] R. G. B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019.
[5] R. G. B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019.
[6] D. Bahdanau, S. Murty, M. Noukhovitch, T. H. Nguyen, H. de Vries, and A. Courville. Systematic generalization: What is required and can it be learned? In ICLR, 2019.
[7] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
[8] E. Barati and X. Chen. Critic­based attention network for event­based video cap­ tioning. In ACM Multimedia, 2019.
[9] Y. Bengio. From system 1 deep learning to system 2 deep learning. NeuripS, 2019.
[10] S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPT­Neo: Large Scale Autoregressive Language Modeling with Mesh­Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata.
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee­ lakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert­Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Rad­ ford, I. Sutskever, and D. Amodei. Language models are few­shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
[12] I. Calixto and Q. Liu. Incorporating global visual features into attention­based neural machine translation. In EMNLP, 2017.
[13] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self­supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
[14] C.­H. Chang, H.­T. Su, J. Hsu, Y.­S. Wang, Y.­C. Chang, Z. Y. Liu, Y.­L. Chang, W.­F. Cheng, K.­J. Wang, and W. H. Hsu. Situation and behavior understanding by trope detection on films. In WWW, 2021.
[15] D. Chen and W. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
[16] S. Chen, J. Chen, Q. Jin, and A. Hauptmann. Video captioning with guidance of multimodal latent topics. In ACM Multimedia, 2017.
[17] A. Cherian, C. Hori, T. K. Marks, and J. Le Roux. (2.5+ 1) d spatio­temporal scene graphs for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 444–453, 2022.
[18] E. Choi, H. He, M. Iyyer, M. Yatskar, W.­t. Yih, Y. Choi, P. Liang, and L. Zettle­ moyer. QuAC: Question answering in context. In EMNLP, 2018.
[19] K. Clark and C. D. Manning. Deep reinforcement learning for mention­ranking coreference models. In EMNLP, 2016.
[20] K. Clark and C. D. Manning. Improving coreference resolution by learning entity­ level distributed representations. In ACL, 2016.
[21] J. Deng, W. Dong, R. Socher, L.­J. Li, K. Li, and L. Fei­Fei. ImageNet: A Large­ Scale Hierarchical Image Database. In CVPR, 2009.
[22] J. Deng, W. Dong, R. Socher, L.­J. Li, K. Li, and L. Fei­Fei. Imagenet: A large­ scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[23] J. Devlin, M.­W. Chang, K. Lee, and K. Toutanova. Bert: Pre­training of deep bidirectional transformers for language understanding. In NAACL­HLT, 2019.
[24] J. Devlin, M.­W. Chang, K. Lee, and K. Toutanova. BERT: Pre­training of deep bidirectional transformers for language understanding. In NAACL­HLT, 2019.
[25] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.­W. Hon. Unified language model pre­training for natural language understanding and generation. In NIPS, 2019.
[26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
[27] J. Dunietz, G. Burnham, A. Bharadwaj, O. Rambow, J. Chu­Carroll, and D. Fer­ rucci. To test machine comprehension, start by defining comprehension. In ACL, 2020.
[28] A. Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL, 8, 2020.
[29] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. Activitynet: A large­scale video benchmark for human activity understanding. In CVPR, 2015.
[30] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In CVPR, 2019.
[31] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Generating attractive visual captions with styles. In CVPR, 2017.
[32] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.
[33] J. Gao, R. Ge, K. Chen, and R. Nevatia. Motion­appearance co­memory networks for video question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6576–6585, Los Alamitos, CA, USA, jun 2018. IEEE Computer Society.
[34] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for lan­ guage modeling. arXiv preprint arXiv:2101.00027, 2020.
[35] E. Goodwin, K. Sinha, and T. J. O’Donnell. Probing linguistic systematicity. In ACL, 2020.
[36] A. C. Graesser, M. Singer, and T. Trabasso. Constructing inferences during narra­ tive text comprehension. Psychological Review, 1994.
[37] L. Gui, B. Wang, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao. Kat: A knowledge augmented transformer for vision­and­language. In NAACL, 2022.
[38] Y. Han, B. Wang, R. Hong, and F. Wu. Movie question answering via textual memory and plot graph. IEEE Transactions on Circuits and Systems for Video Technology, 30(3):875–887, 2020.
[39] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6546–6555, 2018.
[40] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015.
[42] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recog­ nition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, pages 770–778. IEEE, June 2016.
[43] M. Heilman and N. A. Smith. Question generation via overgenerating transforma­ tions and ranking. Technical report, Carnegie­Mellon Univ Pittsburgh pa language technologies insT, 2009.
[44] M. Heilman and N. A. Smith. Good question! statistical ranking for question gen­ eration. In NAACL­HLT, 2010.
[45] S. Hochreiter and J. Schmidhuber. Long short­term memory. Neural computation, 9(8), 1997.
[46] M. Honnibal and I. Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.
[47] P. Hosseini, D. A. Broniatowski, and M. Diab. Knowledge­augmented language models for cause­effect relation classification. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 43–48, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[48] Y. Hu, Z. Chen, Z.­J. Zha, and F. Wu. Hierarchical global­local temporal modeling for video captioning. In ACM Multimedia, 2019.
[49] D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan. Location­aware graph convolutional networks for video question answering. In AAAI, 2020.
[50] J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi. Comet­atomic 2020: On symbolic and neural commonsense knowledge graphs. In AAAI, 2021.
[51] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. TGIF­QA: toward spatio­temporal reasoning in visual question answering. In CVPR, 2017.
[52] P.JiangandY.Han.Reasoningwithheterogeneousgraphalignmentforvideoques­ tion answering. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):11109–11116, Apr. 2020.
[53] Q. Jin, S. Chen, J. Chen, and A. Hauptmann. Knowing yourself: Improving video caption via in­depth recap. In ACM Multimedia, 2017.
[54] W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao, and Y. Zhuang. Multi­interaction network with object relation for video question answering. In ACM Multimedia, 2019.
[55] G. Jiyang, G. Runzhou, C. Kan, and N. Ram. Motion­Appearance Co­Memory Networks for Video Question Answering. In CVPR, 2018.
[56] S. Kar, S. Maharjan, A. P. López­Monroy, and T. Solorio. MPST: A corpus of movie plot synopses with tags. In LREC, 2018.
[57] S. Kar, S. Maharjan, A. P. López­Monroy, and T. Solorio. MPST: A corpus of movie plot synopses with tags. In LREC, 2018.
[58] S. Kar, S. Maharjan, and T. Solorio. Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In COLING, 2018.
[59] W.Kay,J.Carreira,K.Simonyan,B.Zhang,C.Hillier,S.Vijayanarasimhan,F.Vi­ ola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[60] K. Kim, M. Heo, S. Choi, and B. Zhang. Deepstory: Video story QA by deep embedded memory networks. In IJCAI, 2017.
[61] K.­M. Kim, M.­O. Heo, S.­H. Choi, and B.­T. Zhang. Deepstory: Video story qa by deep embedded memory networks. In IJCAI, 2017.
[62] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[63] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. OpenNMT: Open­source toolkit for neural machine translation. In ACL, 2017.
[64] R. Krishna, M. Bernstein, and L. Fei­Fei. Information maximizing visual question generation. In CVPR, 2019.
[65] R.Krishna,Y.Zhu,O.Groth,J.Johnson,K.Hata,J.Kravitz,S.Chen,Y.Kalantidis, L.­J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In International journal of computer vision, 2017.
[66] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. RACE: Large­scale ReAding com­ prehension dataset from examinations. In EMNLP, 2017.
[67] B. M. Lake and M. Baroni. Generalization without Systematicity: On the Compo­ sitional Skills of Sequence­to­Sequence Recurrent Networks. In ICML, 2018.
[68] J.Lehmann,R.Isele,M.Jakob,A.Jentzsch,D.Kontokostas,P.N.Mendes,S.Hell­ mann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia ­ a large­scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6, 2015.
[69] J. Lei, L. Yu, M. Bansal, and T. Berg. Tvqa: Localized, compositional video ques­ tion answering. In EMNLP, 2018.
[70] J. Lei, L. Yu, M. Bansal, and T. Berg. TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
[71] J. Lei, L. Yu, M. Bansal, and T. Berg. TVQA: Localized, compositional video question answering. In EMNLP, 2018.
[72] J. Lei, L. Yu, T. Berg, and M. Bansal. TVQA+: Spatio­temporal grounding for video question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8211–8225, Online, July 2020. Association for Computational Linguistics.
[73] J. Li, L. Niu, and L. Zhang. From representation to reasoning: Towards both evi­ dence and commonsense reasoning for video question­answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
[74] X.Li,L.Gao,X.Wang,W.Liu,X.Xu,H.T.Shen,andJ.Song.Learnableaggregat­ ing net with diversity learning for video question answering. In ACM Multimedia, 2019.
[75] X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan. Beyond rnns: Po­ sitional self­attention with co­attention for video question answering. In AAAI, 2019.
[76] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In CVPR, 2018.
[77] C.­Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004.
[78] X. Lin, G. Bertasius, J. Wang, S.­F. Chang, D. Parikh, and L. Torresani. Vx2text: End­to­end learning of video­based text generation from multimodal inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7005–7015, 2021.
[79] X.Lin,F.Petroni,G.Bertasius,M.Rohrbach,S.­F.Chang,andL.Torresani.Learn­ ing to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13853–13863, 2022.
[80] T. Linzen. How can we accelerate progress towards human­like linguistic general­ ization? In ACL, 2020.
[81] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun. Ivqa: Inverse visual question answering. In CVPR, 2018.
[82] J. Liu, W. Chen, Y. Cheng, Z. Gan, L. Yu, Y. Yang, and J. Liu. Violin: A large­scale dataset for video­and­language inference. In CVPR, 2020.
[83] J. Liu, W. Chen, Y. Cheng, Z. Gan, L. Yu, Y. Yang, and J. Liu. Violin: A large­scale dataset for video­and­language inference. In CVPR, 2020.
[84] S. Liu, Z. Ren, and J. Yuan. Sibnet: Sibling convolutional encoder for video cap­ tioning. In ACM Multimedia, 2018.
[85] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle­moyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[86] X. Long, C. Gan, and G. de Melo. Video captioning with multi­faceted attention. Transactions of the Association for Computational Linguistics, 6:173–184, 2018.
[87] R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969, 2019.
[88] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre­ sentations of words and phrases and their compositionality. In NeurIPS. 2013.
[89] R. B. Palm, U. Paquet, and O. Winther. Recurrent relational networks. In NeurIPS, 2018.
[90] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and trans­ lation to bridge video and language. In CVPR, 2016.
[91] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with transferred semantic attributes. In CVPR, 2017.
[92] K. Papineni, S. Roukos, T. Ward, and W.­J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
[93] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Des­ maison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
[94] T. Pedersen, S. Patwardhan, and J. Michelizzi. Wordnet::similarity: Measuring the relatedness of concepts. In NAACL, 2004.
[95] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word repre­ sentation. In EMNLP, 2014.
[96] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettle­ moyer. Deep contextualized word representations. In NAACL­HLT, 2018.
[97] M. E. Peters, S. Ruder, and N. A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. In RepL4NLP, 2019.
[98] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
[99] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text­to­text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
[100] P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for SQuAD. In ACL, 2018.
[101] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
[102] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics.
[103] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
[104] S. Reddy, D. Chen, and C. D. Manning. CoQA: A conversational question an­ swering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
[105] S. Ren, K. He, R. Girshick, and J. Sun. Faster r­cnn: Towards real­time object detection with region proposal networks. In NeurIPS, 2015.
[106] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R­CNN: towards real­time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
[107] A. Rosenbaum, S. Soltan, W. Hamza, Y. Versley, and M. Boese. LINGUIST: Language model instruction tuning to generate annotated utterances for intent clas­ sification and slot tagging. In Proceedings of the 29th International Conference on Computational Linguistics, pages 218–241, Gyeongju, Republic of Korea, Oct. 2022. International Committee on Computational Linguistics.
[108] G. Sahu, P. Rodriguez, I. Laradji, P. Atighehchian, D. Vazquez, and D. Bahdanau. Data augmentation for intent classification with off­the­shelf large language mod­ els. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 47–57, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[109] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension, 2017.
[110] X. Shi, J. Cai, S. Joty, and J. Gu. Watch it twice: Video captioning with a refocused video encoder. In ACM Multimedia, 2019.
[111] K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton. CLUTRR: A diag­ nostic benchmark for inductive reasoning from text. In EMNLP­IJCNLP, 2019.
[112] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota. Harnessing a.i. for augmenting creativity: Application to movie trailer creation. In ACM MM, 2017.
[113] H.­T. Su, C.­H. Chang, P.­W. Shen, Y.­S. Wang, Y.­L. Chang, Y.­C. Chang, P.­ J. Cheng, and W. H. Hsu. End­to­end video question­answer generation with generator­pretester network. IEEE Transactions on Circuits and Systems for Video Technology, 31(11):4497–4507, 2021.
[114] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. End­to­end memory networks. In NIPS. 2015.
[115] P. Tang, H. Wang, H. Wang, and K. Xu. Richer semantic visual and language representation for video captioning. In ACM Multimedia, 2017.
[116] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question­answering. In CVPR, 2016.
[117] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fi­ dler. MovieQA: Understanding stories in movies through question­answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[118] A.Trischler,T.Wang,X.Yuan,J.Harris,A.Sordoni,P.Bachman,andK.Suleman. Newsqa: A machine comprehension dataset. In Rep4NLP, 2017.
[119] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
[120] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017.
[121] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS. 2017.
[122] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus­based image description evaluation. In CVPR, 2015.
[123] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. In ICCV, 2015.
[124] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi­task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
[125] H. Wang, Y. Xu, and Y. Han. Spotting and aggregating salient regions for video captioning. In ACM Multimedia, 2018.
[126] J.Wang,W.Wang,Y.Huang,L.Wang,andT.Tan.Hierarchicalmemorymodelling for video captioning. In ACM Multimedia, 2018.
[127] Y.­S. Wang, H.­T. Su, C.­H. Chang, Z.­Y. Liu, and W. Hsu. Video question gener­ ation via cross­modal self­attention networks learning. In ICASSP, 2020.
[128] Y.­S. Wang, H.­T. Su, C.­H. Chang, Z.­Y. Liu, and W. H. Hsu. Video question generation via semantic rich cross­modal self­attention networks learning. In ICASSP 2020 ­ 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2423–2427, 2020.
[129] Z.Wang,M.Li,R.Xu,L.Zhou,J.Lei,X.Lin,S.Wang,Z.Yang,C.Zhu,D.Hoiem, et al. Language models with image descriptors are strong few­shot video­language learners. arXiv preprint arXiv:2205.10747, 2022.
[130] J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multi­hop reading comprehension across documents. TACL, 6, 2018.
[131] T. Winterbottom, S. Xiao, A. McLean, and N. A. Moubayed. On modality bias in the tvqa dataset. In BMVC, 2020.
[132] J. Xiao, X. Shang, A. Yao, and T.­S. Chua. NExT­QA: Next phase of question­ answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[133] J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.­S. Chua. Video as conditional graph hierarchy for multi­granular question answering. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), pages 2804–2812, 2022.
[134] J.Xiao,P.Zhou,T.­S.Chua,andS.Yan.Videographtransformerforvideoquestion answering. In European Conference on Computer Vision, pages 39–58. Springer, 2022.
[135] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed­accuracy trade­offs in video classification. In ECCV, 2018.
[136] Y. Xiong, B. Dai, and D. Lin. Move forward and tell: A progressive generator of video descriptions. In ECCV, 2018.
[137] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
[138] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
[139] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
[140] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr­vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
[141] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr­vtt: A large video description dataset for bridging video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[142] J. Xu, T. Yao, Y. Zhang, and T. Mei. Learning multimodal attention lstm networks for video captioning. In ACM Multimedia, 2017.
[143] N.Xu,A.Liu,Y.Wong,Y.Zhang,W.Nie,Y.Su,andM.Kankanhalli.Dual­stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2482–2493, 2019.
[144] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. Stat: Spatial­temporal attention mechanism for video captioning. IEEE Transactions on Multimedia, 22(1):229–241, 2020.
[145] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Just ask: Learning to an­ swer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
[146] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. Learning to answer visual questions from web videos. IEEE transactions on pattern analysis and machine intelligence, PP, 2022.
[147] J. Yang, Y. Zhu, Y. Wang, R. Yi, A. Zadeh, and L.­P. Morency. What gives the answer away? question answering bias analysis on video qa datasets. In Human Multimodal Language Workshop, 2020.
[148] T. Yang, Z.­J. Zha, H. Xie, M. Wang, and H. Zhang. Question­aware tube­switch network for video question answering. In ACM Multimedia, 2019.
[149] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS. 2019.
[150] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang. An empirical study of gpt­3 for few­shot knowledge­based vqa. In AAAI, 2022.
[151] Z. Yang, Y. Han, and Z. Wang. Catching the temporal regions­of­interest for video captioning. In ACMn Multimedia, 2017.
[152] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Man­ ning. HotpotQA: A dataset for diverse, explainable multi­hop question answering. In EMNLP, 2018.
[153] Z. Yang, Y. Xu, H. Wang, B. Wang, and Y. Han. Multirate multimodal video cap­ tioning. In ACM Multimedia, 2017.
[154] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015.
[155] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020.
[156] X. Yin and V. Ordonez. Obj2text: Generating visually descriptive language from object layouts. In EMNLP, 2017.
[157] A. W. Yu, D. Dohan, Q. Le, T. Luong, R. Zhao, and K. Chen. Fast and accurate reading comprehension by combining self­attention and convolution. In ICLR, 2018.
[158] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao. Activitynet­qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
[159] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao. ActivityNet­QA: A dataset for understanding complex web videos via question answering. 2019.
[160] A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L.­P. Morency. Social­iq: A question answering benchmark for artificial social intelligence. In CVPR, 2019.
[161] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A large­scale adversarial dataset for grounded commonsense inference. In EMNLP, 2018.
[162] K.­H. Zeng, T.­H. Chen, C.­Y. Chuang, Y.­H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. In AAAI, 2017.
[163] K.­H. Zeng, T.­H. Chen, C.­Y. Chuang, Y.­H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017.
[164] K.­H. Zeng, T.­H. Chen, C.­Y. Chuang, Y.­H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. In AAAI, 2017.
[165] H. Zhang, H. Liang, Y. Zhang, L.­M. Zhan, X.­M. Wu, X. Lu, and A. Lam. Fine­tuning pre­trained language models for few­shot intent detection: Supervised pre­training and isotropization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 532–542, Seattle, United States, July 2022. Associ­ ation for Computational Linguistics.
[166] W. Zhang, S. Tang, Y. Cao, S. Pu, F. Wu, and Y. Zhuang. Frame augmented al­ ternating attention network for video question answering. IEEE Transactions on Multimedia, 22(4):1032–1041, 2020.
[167] Z. Zhao, J. Lin, X. Jiang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical dual­level attention network learning. In ACM Multimedia, 2017.
[168] Z. Zhao, Q. Yang, D. Cai, X. He, Y. Zhuang, Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical spatio­temporal attention networks. In IJCAI, 2017.
[169] Z.Zhao,Z.Zhang,S.Xiao,Z.Yu,J.Yu,D.Cai,F.Wu,andY.Zhuang.Open­ended long­form video question answering via adaptive hierarchical reinforced networks. In IJCAI, 2018.
[170] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End­to­end dense video captioning with masked transformer. In CVPR, 2018.
[171] Y. Zhu and S. Jiang. Attention­based densely connected lstm for video captioning. In ACM Multimedia, 2019.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88611-
dc.description.abstract機器學習模型的多模式影片理解對應了人類的視覺和文字的感知,對於各種應用至關重要。然而,以往基於監督學習的研究面臨兩個根本性的挑戰:(1)需要大量的人工標注數據和(2)缺乏進行系統二推理的能力。本研究提出三種創新的解決方法來應對以上挑戰。
首先,為了降低數據標注成本,本研究提出影片問答生成這樣新任務,該任務的目標為自動生成用於訓練影片問答系統的問題與答案。與以往依賴於文字描述的問答生成方法不同,本研究直接輸入影片並避免了訊息遺失。為了解決影片問答生成問題,我們設計了一個生成器­預測器網絡,通過試圖回答問題來鼓勵模型輸出可回答的問題。
其次,我們提出了一種解決因果影片問答問題的方案:從語言模型中提取因果常識。不同於傳統僅依賴於視覺觀察的影片問答,因果影片問答整合了常識,以進行更複雜的推理。現有基於文字描述的問答方法僅限於提取關聯知識,因此不適用於因果影片問答。為了解決這個挑戰,我們利用在訓練過程中學習到廣泛因果關係的語言模型來提取常識。我們將意圖­動作的配對輸入語言模型得到回應後,然後將其轉化為問題­答案的配對,並將之用於訓練影片問答系統。
第三,為了檢驗和發展機器的系統二推理能力,我們提出了橋段理解這樣新任務。做為創作中的敘事工具,理解橋段需要對於因果和動機的推理能力。我們收集了兩個橋段理解資料集,並於實驗中發現先進模型與人類表現之間的顯著差距。為了解決這個問題,我們提出了兩種方法:(1)多層次理解模型以理解時間(故事情節)和空間(角色關係)兩個維度。(2)角色設定理解和說故事模型,通過學習在潛在空間中對視覺和文字描述特徵進行對齊,以學習文字描述中人進行的詮釋。
實驗結果證明了我們提出方法的有效性,在影片問答生成、橋段理解和零樣本因果影片問答超越了前沿的方法。此外,我們提供了詳細的分析以開闢新的研究路徑。
zh_TW
dc.description.abstractUnderstanding multi­-modal video inputs, reflecting human visual and textual cognition, is crucial for various applications. However, previous supervised learning studies have faced two fundamental challenges: (1) the need for significant human annotation and (2) the lack of ability to perform system 2­level reasoning. In this research, we propose three novel components to address these challenges.
First, to lower the cost of data annotation, we present a task called Video Question­ Answer Generation (VQAG), which automatically generates question­answer pairs for training Video QA systems. Unlike previous QA­generation methods that rely on cap­tions, we directly input video and avoid information loss. To address VQAG, we design a Generator­Pretester Network that encourages the model to output answerable questions by attempting to answer them.
Secondly, we propose a solution for tackling causal Video QA by extracting causal commonsense knowledge from language models. Unlike traditional Video QA that solely relies on visual observations, Causal Video QA integrates commonsense knowledge for more sophisticated reasoning. Existing caption­based QA methods are limited to extract­ ing association knowledge, making them unsuitable for causal Video QA. To address this challenge, we utilize language models that have observed vast causal relations during train­ ing to extract commonsense knowledge. We prompt the models with intention­action pairs and extract responses, which are then transformed into question­answer pairs. These pairs can be used to train Video QA systems.
Third, we propose a novel task, Trope Understanding, to examine and develop machines’ System 2 reasoning capabilities. Understanding storytelling devices called tropes requires causal and motivational reasoning skills. We collect two movie trope understand­ing datasets and highlight the significant gaps between state­of­the­art models and human performance. To address this, we propose two methods: (1) Multi­level Comprehension Model, which comprehends both temporal (storyline) and spatial (character relations) di­mensions. (2) Trope Understanding and Storytelling Model leverage human interpretation by learning to align visual and textual features in a latent space.
Experimental results demonstrate the effectiveness of our proposed components, out­ performing traditional state­of­the­art methods on Video Question Generation, Trope Un­derstanding, and Zero­Shot Causal Video Question Answering. Moreover, we provide detailed analysis to pave the way for future work.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:03:23Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-15T17:03:23Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsChapter 1 Introduction 1
Chapter 2 Video Question­Answer Generation (VQAG) 5
2.1 Introduction 5
2.2 Related Work 10
2.3 GeneratorPretester Network (GPN) 12
2.3.1 Joint QuestionAnswer Generator (JQAG) 13
2.3.1.1 Video Encoder 13
2.3.1.2 Answer Selector 15
2.3.1.3 Question Generator 15
2.3.2 Pretester (PT) 16
2.4 Question Generation Experiments 18
2.4.1 Experimental Setup 18
2.4.1.1 Data 18
2.4.1.2 Evaluation Metrics 19
2.4.1.3 Implementation Detail 19
2.4.2 Question Generation Performance 20
2.4.2.1 ActivitynetQA Results 20
2.4.2.2 TVQA Results 21
2.4.3 Ablation Study 21
2.4.4 Case Study 24
2.4.5 Error Analysis 25
2.5 Application in Question Answering 27
2.5.1 Setup 27
2.5.2 Question Answering Performance 28
2.5.3 Pretraining Data Analysis 29
2.6 Conclusion 30
Chapter 3 Causal Knowledge Extraction from Language Models (CaKE­-LM) 33
3.1 Introduction 33
3.2 Related Work 36
3.2.1 Causal Video Question Answering 36
3.2.2 QA Generation for Video QA 37
3.2.3 Language Models Adaptation 38
3.3 Approach 38
3.3.1 Causal Knowledge from Language Models 39
3.3.2 Knowledge Extraction by Prompting 40
3.3.3 QuestionAnswer Generation 41
3.3.4 Distillation with LM Answer 42
3.4 Experiments 43
3.4.1 Setup 43
3.4.2 Video QA Performance 46
3.4.3 Language Model Analysis 48
3.4.4 Error Analysis 50
3.4.5 Potential Extensions 51
3.5 Conclusion 52
Chapter 4 Trope in Movie Synopses (TiMoS) 53
4.2 Related Works 57
4.2.1 Contextual Embedding 57
4.2.2 Machine Comprehension Dataset 58
4.2.3 Films Related Works 59
4.3 New Dataset: TiMoS 60
4.3.1 Overview 60
4.3.2 Data Collection 61
4.3.3 Data Analysis 62
4.3.4 Challenges 64
4.4 Method 66
4.4.1 Task Formulation 66
4.4.2 MultiLevel Comprehension Network 66
4.4.3 MultiStep Recurrent Relational Network 69
4.5 Experiments 72
4.5.1 Baselines 72
4.5.2 Streams 73
4.6 Results and Analysis 74
4.6.1 Baseline Comparison 74
4.6.2 Trope Difficulty 75
4.6.3 Knowledge Adaption 76
4.6.4 Qualitative Analysis 78
4.7 Human Evaluation and Discussion 79
4.7.1 Human Inference Process 79
4.7.2 Setup 79
4.7.3 Human Performance 80
4.7.4 Potential Directions 81
4.8 Conclusion 81
Chapter 5 Trope Understanding in Movies and Animations (TrUMAn) 83
5.1 Introduction 83
5.2 Impact and Potential Extensions 86
5.3 Related Work 88
5.4 TrUMAn Dataset 90
5.4.1 Overview 90
5.4.2 Data Collection 92
5.4.3 Data Analysis 92
5.4.4 Human Evaluation on TrUMAn 93
5.4.5 Data Availability 95
5.5 Trope Understanding and Storytelling (TrUSt) Model 96
5.5.1 Video Encoder 97
5.5.2 Conceptual Storyteller 98
5.5.3 Trope Understanding 99
5.6 Experiments 101
5.6.1 Modality 101
5.6.2 Compared Methods 102
5.6.3 Results and Discussion 103
5.7 Conclusion 107
Chapter 6 Conclusion 109
References 111
-
dc.language.isoen-
dc.subject影片問題生成zh_TW
dc.subject影片問答zh_TW
dc.subject橋段理解zh_TW
dc.subject零樣本學習zh_TW
dc.subjectVideo Question Answeringen
dc.subjectZero-Shot Learningen
dc.subjectVideo Question Generationen
dc.subjectTrope Understandingen
dc.title多模態影片理解及其擴展zh_TW
dc.titleMulti­-modal Video Comprehension and Beyonden
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree博士-
dc.contributor.oralexamcommittee鄭卜壬;賴尚宏;林嘉文;彭文志;王蒞君;陳信希;李政德;李宏毅zh_TW
dc.contributor.oralexamcommitteePu-Jen Cheng;Shang-Hong Lai;Chia-Wen Lin;Wen-Chih Peng;Li-Chun Wang;Hsin-Hsi Chen;Cheng-Te Li;Hung-Yi Leeen
dc.subject.keyword影片問答,影片問題生成,零樣本學習,橋段理解,zh_TW
dc.subject.keywordVideo Question Answering,Video Question Generation,Zero-Shot Learning,Trope Understanding,en
dc.relation.page131-
dc.identifier.doi10.6342/NTU202302209-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2023-08-01-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
21.15 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved