Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86518Full metadata record
| ???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
|---|---|---|
| dc.contributor.advisor | 徐宏民 | zh_TW |
| dc.contributor.advisor | Winston H. Hsu | en |
| dc.contributor.author | 李信穎 | zh_TW |
| dc.contributor.author | Hsin-Ying Lee | en |
| dc.date.accessioned | 2023-03-20T00:00:34Z | - |
| dc.date.available | 2023-11-10 | - |
| dc.date.copyright | 2022-08-18 | - |
| dc.date.issued | 2022 | - |
| dc.date.submitted | 2002-01-01 | - |
| dc.identifier.citation | [1] Harsh Agrawal, Arjun Chandrasekaran, Dhruv Batra, Devi Parikh, and Mohit Bansal. Sort story: Sorting jumbled images and captions into stories. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 925–931, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1091. URL https://aclanthology.org/D16-1091.
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [3] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021. [4] Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the ”video” in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2917–2927, June 2022. [5] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600, 2018. URL https://arxiv.org/abs/1808.01340. [6] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3558–3568, June 2021. [7] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015. URL https://arxiv.org/abs/1504.00325. [8] Seongho Choi, Kyoung-Woon On, Yu-Jung Heo, Ahjeong Seo, Youwon Jang, Minsu Lee, and Byoung-Tak Zhang. Dramaqa: Character-centered video story understanding with hierarchical qa. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2):1166–1174, May 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/16203. [9] Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 2634–2641, 2013. doi: 10.1109/CVPR.2013.340. [10] Ali Diba, Vivek Sharma, and Luc Van Gool. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [11] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jürgen Gall, Rainer Stiefelhagen, and Luc Van Gool. Large scale holistic video understanding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 593–610, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58558-7. [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy. [13] Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [14] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual net- works for video action recognition. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Cur- ran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/ 3e7e0224018ab3cf51abb96464d518cd-Paper.pdf. [15] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [16] Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [17] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. [18] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet : End-to-end video-language transformers with masked visual-token modeling, 2021. URL https://arxiv.org/abs/2111.12681. [19] Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [20] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [21] Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. [22] MadeleineGrunde-McLaughlin,RanjayKrishna,andManeeshAgrawala.Agqa2.0:Anupdated benchmark for compositional spatio-temporal reasoning. arXiv preprint arXiv:2204.06105, 2022. [23] Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. Location- aware graph convolutional networks for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11021–11028, 2020. [24] Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. Comet-atomic 2020: On symbolic and neural commonsense knowl- edge graphs. In AAAI, 2021. [25] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021. [26] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J Henaff, Matthew Botvinick, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=fILj7WpI-g. [27] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [28] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 18–24 Jul 2021. URL https://proceedings. mlr.press/v139/jia21b.html. [29] Jianwen Jiang, Ziqiang Chen, Haojie Lin, Xibin Zhao, and Yue Gao. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):11101–11108, Apr. 2020. doi: 10.1609/ aaai.v34i07.6766. URL https://ojs.aaai.org/index.php/AAAI/article/view/6766. [30] Pin Jiang and Yahong Han. Reasoning with heterogeneous graph alignment for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):11109–11116, Apr. 2020. doi: 10.1609/aaai.v34i07.6767. URL https://ojs.aaai.org/index.php/ AAAI/article/view/6767. [31] Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, and Chang D. Yoo. Progressive attention memory network for movie story question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [32] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In International Conference on Computer Vision (ICCV), 2017. [33] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017. [34] Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. Hierarchical conditional relation networks for multimodal video question answering. Int. J. Comput. Vision, 129 (11):3027–3050, nov 2021. ISSN 0920-5691. doi: 10.1007/s11263-021-01514-3. URL https://doi.org/10.1007/s11263-021-01514-3. [35] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. In EMNLP, 2018. [36] Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. What is more likely to happen next? video- and-language future event prediction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8769–8784, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.706. URL https://aclanthology.org/2020.emnlp-main.706. [37] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7331–7341, June 2021. [38] Jie Lei, Tamara L. Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning, 2022. URL https://arxiv.org/abs/2206.03428. [39] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/ forum?id=OJLaKwiXSbx. [40] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. [41] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for Video+Language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2046–2065, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/ v1/2020.emnlp-main.161. URL https://aclanthology.org/2020.emnlp-main.161. [42] Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. Beyond rnns: Positional self-attention with co-attention for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):8658–8665, Jul. 2019. doi: 10.1609/aaai.v33i01.33018658. URL https://ojs.aaai.org/index.php/ AAAI/article/view/4887. [43] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A New Dataset and Benchmark on Animated GIF Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [44] Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. Hair: Hierarchical visual-semantic rela- tional reasoning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1698–1707, October 2021. [45] Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, and Jingjing Liu. Violin: A large-scale dataset for video-and-language inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [46] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, June 2022. [47] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf. [48] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation, 2020. [49] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. [50] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In CVPR, 2020. [51] Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. A read-write memory network for movie story understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 677–685, 2017. [52] Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, and Yejin Choi. Visual- comet: Reasoning about the dynamic context of a still image. In In Proceedings of the European Conference on Computer Vision (ECCV), 2020. [53] Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bridge to answer: Structure-aware graph interaction network for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15526–15535, June 2021. [54] Liang Peng, Shuangji Yang, Yi Bin, and Guoqing Wang. Progressive Graph Attention Network for Video Question Answering, page 2871–2879. Association for Computing Machinery, New York, NY, USA, 2021. ISBN 9781450386517. URL https://doi.org/10.1145/3474085. 3475193. [55] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020. [56] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. Int. J. Comput. Vision, 123(1):94–120, may 2017. ISSN 0920-5691. doi: 10.1007/s11263-016-0987-1. URL https: //doi.org/10.1007/s11263-016-0987-1. [57] Michael S. Ryoo, A. J. Piergiovanni, Juhana Kangaspunta, and Anelia Angelova. Assem- blenet++: Assembling modality representations via attention connections. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 654–671, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58565-5. [58] Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, and Anelia Angelova. Assemblenet: Search- ing for multi-stream neural connectivity in video architectures. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJgMK64Ywr. [59] Paul Hongsuck Seo, Arsha Nagrani, and Cordelia Schmid. Look before you speak: Visually contextualized utterances. In Computer Vision and Pattern Recognition (CVPR), 2021. [60] PaulHongsuckSeo,ArshaNagrani,AnuragArnab,andCordeliaSchmid.End-to-endgenerative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17959–17968, June 2022. [61] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018. [62] Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 510–526, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46448-0. [63] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Cur- ran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/ 00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf. [64] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre- training of generic visual-linguistic representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SygXPaEYvH. [65] ChenSun,FabienBaradel,KevinMurphy,andCordeliaSchmid.Learningvideorepresentations using contrastive bidirectional transformer, 2019. URL https://arxiv.org/abs/1906. 05743. [66] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. [67] Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1514. URL https://aclanthology.org/D19-1514. [68] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631– 4640, 2016. [69] Aisha Urooj, Amir Mazaheri, Niels Da vitoria lobo, and Mubarak Shah. MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4648–4660, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. findings-emnlp.417. URL https://aclanthology.org/2020.findings-emnlp.417. [70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [71] Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training, 2022. URL https://arxiv.org/abs/2203.07303. [72] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 20–36, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46484-8. [73] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In The IEEE International Conference on Computer Vision (ICCV), October 2019. [74] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021. [75] Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. [76] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In Pro- ceedings of the 25th ACM International Conference on Multimedia, MM ’17, page 1645–1653, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349062. doi: 10.1145/3123266.3123427. URL https://doi.org/10.1145/3123266.3123427. [77] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017. [78] KelvinXu,JimmyLeiBa,RyanKiros,KyunghyunCho,AaronCourville,RuslanSalakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning - Volume 37, ICML’15, page 2048–2057. JMLR.org, 2015. [79] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021. [80] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Learning to answer visual questions from web videos. IEEE transactions on pattern analysis and machine intelligence, PP, 2022. [81] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models, 2022. URL https://arxiv. org/abs/2206.08155. [82] Kexin Yi*, Chuang Gan*, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=HkxYzANYDB. [83] Weijiang Yu, Haoteng Zheng, Mengfei Li, Lei Ji, Lijun Wu, Nong Xiao, and Nan Duan. Learn- ing from inside: Self-driven siamese sampling and reasoning for video question answering. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 26462–26474. Cur- ran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ dea184826614d3f4c608731389ed0c74-Paper.pdf. [84] Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. [85] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):9127–9134, Jul. 2019. doi: 10.1609/aaai.v33i01.33019127. URL https://ojs.aaai.org/index.php/AAAI/ article/view/4946. [86] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. In Advances in Neural Information Processing Systems 34, 2021. [87] Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. Leveraging video descriptions to learn video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. doi: 10.1609/aaai.v31i1.11238. URL https://ojs.aaai.org/index.php/AAAI/article/view/11238. [88] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588, June 2021. [89] Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the 25th ACM international conference on Multimedia, pages 1050–1058, 2017. [90] Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, Yueting Zhuang, Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. Video question answering via hierarchical spatio-temporal attention networks. In IJCAI, volume 2, page 8, 2017. [91] Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86518 | - |
| dc.description.abstract | 雖然最近大規模的影片語言預訓練在影片問答方面有了很大的進展,但空間建模的設計沒有圖像語言模型的那麼精緻;現有的時間建模方式也受到模態之間沒有對齊的影響。為了學習精緻的視覺理解,我們將時空建模解耦,並提出了一種結合圖像和影片語言編碼器的混合結構。前者獨立於時間從較大但稀疏採樣的 影格中理解空間語義,而後者在較低空間但較高時間解析度下捕捉時間動態。另外,為了幫助影片語言模型學習影片問答的時間關係,我們提出了一種新穎的預訓練目標,即時間引用建模,它要求模型辨別影片序列中事件的時間位置。透過廣泛且詳細的實驗,我們證明這個方法做得比以前在數量級更大的資料集上預訓練的研究更好。 | zh_TW |
| dc.description.abstract | While recent large-scale video-language pre-training made great progress in video question answering, the design of spatial modeling is less fine-grained than that of image language models; existing practices of temporal modeling also suffer from weak and noisy alignment between modalities. To learn fine-grained visual understanding, we decouple spatial-temporal modeling and propose a hybrid pipeline integrating an image and a video-language encoder. The former encodes spatial semantics from larger but sparsely sampled frames independently of time, while the latter models temporal dynamics at lower spatial but higher temporal resolution. To help the video-language model learn temporal relations for video QA, we propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Extensive and detailed experiments demonstrate that our model outperforms previous work that pre-trained on orders of magnitude larger datasets. | en |
| dc.description.provenance | Made available in DSpace on 2023-03-20T00:00:34Z (GMT). No. of bitstreams: 1 U0001-0308202220082500.pdf: 7539288 bytes, checksum: 72d279cb46e85b2879ba4ddef6c23b50 (MD5) Previous issue date: 2022 | en |
| dc.description.tableofcontents | Verification Letter from the Oral Examination Committee ... ii
Acknowledgements ... iii 摘要 ... v Abstract ... vii Contents ... ix List of Figures ... xiii List of Tables ... xv Chapter 1 Introduction ... 1 Chapter 2 Related Work ... 5 2.1 Video Question Answering ... 5 2.2 Pretraining for Temporal Relation Modeling ... 5 2.2.1 LearningfromGlobalAlignment ... 6 2.2.2 Learning from Local Alignment and Frame Ordering ... 6 2.2.3 Learning from LargeScale Video Question Answering Datasets ... 6 2.3 Encoding Motion and Appearance ... 7 Chapter 3 Method ... 9 3.1 Decoupled Spatial-Temporal Encoders ... 9 3.1.1 Image-Language Encoding ... 11 3.1.2 Video Language Encoding ... 11 3.1.3 Answer Selection ... 12 3.2 Temporal Referring Modeling ... 12 Chapter 4 Experiments ... 15 4.1 Preliminary Analysis ... 15 4.1.1 Encoding Spatial Semantics ... 15 4.1.2 Modeling Temporal Relationships ... 16 4.2 Video Question Answering ... 18 4.3 Ablation Studies ... 19 Chapter 5 Conclusion ... 21 References ... 23 Appendix A — Implementation Details ... 37 A.1 Model Architectures ... 37 A.2 Video Language Pre-training ... 38 A.2.1 Details of Question and Video Synthesis for Temporal Referring Modeling ... 38 A.2.2 Auxiliary Objective with Contrastive Learning ... 39 A.2.3 Pre-training Datasets ... 40 A.3 Optimization ... 41 Appendix B — Experiment Details ... 43 B.1 Details of Temporal Modeling Analysis ... 43 B.2 Pre-training Data Used by Prior Approaches ... 44 B.3 Full Results and Analysis on AGQA 2.0 ... 44 B.3.1 Full Results of Temporal Modeling Analysis ... 45 B.3.2 Full Results and Analysis of Our Method ... 45 B.3.3 Full Results of Ablation Study of Encoding Streams ... 48 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 時空間推理 | zh_TW |
| dc.subject | 影片問答 | zh_TW |
| dc.subject | 機器學習 | zh_TW |
| dc.subject | 影片理解 | zh_TW |
| dc.subject | Deep Learning | en |
| dc.subject | Video Understanding | en |
| dc.subject | Spatial-Temporal Modeling | en |
| dc.subject | Machine Learning | en |
| dc.subject | Video Question Answering | en |
| dc.title | 透過解耦學習影片問答中的時空間關係 | zh_TW |
| dc.title | Learning by Decoupling Spatial and Temporal Relations for Video Question Answering | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 陳文進;葉梅珍;陳奕廷 | zh_TW |
| dc.contributor.oralexamcommittee | Wen-Chin Chen;Mei-Chen Yeh;Yi-Ting Chen | en |
| dc.subject.keyword | 機器學習,深度學習,影片理解,時空間推理,影片問答, | zh_TW |
| dc.subject.keyword | Machine Learning,Deep Learning,Video Understanding,Spatial-Temporal Modeling,Video Question Answering, | en |
| dc.relation.page | 49 | - |
| dc.identifier.doi | 10.6342/NTU202202028 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2022-08-16 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2022-08-18 | - |
| Appears in Collections: | 資訊工程學系 | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-111-1.pdf | 7.36 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
