請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97461完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 王鈺強 | zh_TW |
| dc.contributor.advisor | Yu-Chiang Frank Wang | en |
| dc.contributor.author | 陳志臻 | zh_TW |
| dc.contributor.author | Jr-Jen Chen | en |
| dc.date.accessioned | 2025-06-18T16:14:54Z | - |
| dc.date.available | 2025-06-19 | - |
| dc.date.copyright | 2025-06-18 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-06-04 | - |
| dc.identifier.citation | [1] The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic, 2024.
[2] Gpt-4 system card. Technical report, OpenAI, 2024. [3] Reka core, flash, and edge: A series of powerful multimodal language models. Technical report, Reka, 2024. [4] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [5] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. [6] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ICCV, 2017. [7] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. [8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. [9] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally grounding natural sentence in video. In EMNLP, 2018. [10] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024. [11] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022. [12] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019. [13] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017. [14] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. [15] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with temporal language. In EMNLP, 2018. [16] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. [17] Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In CVPR, 2024. [18] De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. arXiv preprint arXiv:2403.19046, 2024. [19] Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. In ICCV, 2023. [20] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023. [21] Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. In NeurIPS, 2021. [22] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. [23] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training. In EMNLP, 2020. [24] Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. In ICCV, 2023. [25] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024. [26] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. [27] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. In NeurIPS, 2024. [28] WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae-Pil Heo. Correlation-guided query-dependency calibration in video representation learning for temporal grounding. arXiv preprint arXiv:2311.08835, 2023. [29] WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representation for moment retrieval and highlight detection. In CVPR, 2023. [30] Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. [31] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024. [32] Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, and Siliang Tang. Momentor: Advancing video large language model with fine-grained temporal reasoning. In ICML, 2024. [33] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In CVPR, 2024. [34] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. Tvsum: Summarizing web videos using titles. In CVPR, 2015. [35] Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Rin Metcalf, Walter Talbott, Natalie Mackraz, R Devon Hjelm, and Alexander T Toshev. Large language models as generalizable policies for embodied tasks. In CoRL, 2024. [36] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. [37] Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos. arXiv preprint arXiv:2403.10228, 2024. [38] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. [39] Junbin Xiao, Angela Yao, Yicong Li, and Tat Seng Chua. Can i trust your answer? visually grounded video question answering. In CVPR, 2024. [40] Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. Multilevel language and vision integration for text-to-clip retrieval. In AAAI, 2019. [41] Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vidchapters-7m: Video chapters at scale. 2023. [42] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. In CVPR, 2024. [43] Yitian Yuan, Tao Mei, and Wenwu Zhu. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI, 2019. [44] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. In EMNLP, 2023. [45] Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Span-based localizing network for natural language video localization. In ACL, 2020. [46] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. In ICLR, 2024. [47] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020. [48] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In ICLR, 2024. [49] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. [50] Ming Zhu, Aman Ahuja, Da-Cheng Juan, Wei Wei, and Chandan K Reddy. Question answering with long multiple-span answers. In EMNLP, 2020. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97461 | - |
| dc.description.abstract | 我們提出 ReXTime,這是一個旨在嚴格測試人工智慧模型在影片事件中進行時間推理能力的基準測試。具體而言,ReXTime 專注於「跨時間推理」,即類似人類的理解能力,當問題及其相應答案出現在不同的影片片段中。這種推理形式要求對影片片段之間的因果關係有高級理解,即使對最前沿的多模態大型語言模型也構成了重大挑戰。為了促進這一評估,我們開發了一個自動化流程來生成時間推理問答對,顯著減少了對勞動密集型人工標註的需求。我們的基準測試包括921個經過仔細審核的驗證樣本和2,143個測試樣本,每個樣本都經過人工篩選以確保準確性和相關性。評估結果表明,雖然前沿大型語言模型優於學術模型,但它們的表現仍然比人類差距顯著,準確率相差14.3%。此外,我們的流程創建了9,695個機器生成的訓練數據集樣本,無需人工努力,實證研究表明這些樣本可以通過微調增強跨時間推理能力。 | zh_TW |
| dc.description.abstract | We introduce RexTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, RexTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3\% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-06-18T16:14:54Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-06-18T16:14:54Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Verification Letter from the Oral Examination Committee i
Acknowledgements ii 摘要 iv Abstract v Contents vii List of Figures xi List of Tables xiii Chapter 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Related work 5 2.1 Temporal reasoning and event localization in videos . . . . . . . . . 5 2.2 Query depend moment retrieval . . . . . . . . . . . . . . . . . . . . 5 2.3 Grounding large video-language models . . . . . . . . . . . . . . . . 6 Chapter 3 Data collection 8 3.1 Selecting videos to annotate . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Question-answering on two events across time . . . . . . . . . . . . 9 3.2.1 Finding candidate event pairs . . . . . . . . . . . . . . . . . . . . . 10 3.2.2 Event relation classification . . . . . . . . . . . . . . . . . . . . . . 10 3.2.3 Question-answer generation . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Balancing cheap machine generated data and high-quality human an- notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Automatic data verification for cost reduction . . . . . . . . . . . . 12 3.3.2 Mitigating the modality misalignment . . . . . . . . . . . . . . . . 13 Chapter 4 Benchmark 14 4.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 How far are frontier MLLMs to solving REXTIME? . . . . . . . . . . 14 4.3 Are academic and open source models competitive? . . . . . . . . . 15 4.4 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4.1 Question-answer intersection of union . . . . . . . . . . . . . . . . 17 4.4.2 Average certificate lengths . . . . . . . . . . . . . . . . . . . . . . 18 4.4.3 Comparison to similar tasks . . . . . . . . . . . . . . . . . . . . . . 18 4.4.4 Other statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.5 BlindQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 5 Conclusion 21 References 23 Appendix A — Introduction 30 A.1 Changelog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.1.1 Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.1.2 Version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.2 Additional documentation and resources . . . . . . . . . . . . . . . .30 A.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.2.2 Social Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A.2.3 Data source links . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.2.4 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.2.5 Author statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.2.6 Maintenance plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 A.2.7 Digital object identifier (DOI) . . . . . . . . . . . . . . . . . . . . 33 A.2.8 Annotation instruction . . . . . . . . . . . . . . . . . . . . . . . . 33 A.3 Additional implementation details . . . . . . . . . . . . . . . . . . . 33 A.3.1 Source Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 A.3.2 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.3.3 Cost estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3.4 Computing resources . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3.5 Training details and hyper-parameters . . . . . . . . . . . . . . . . 36 A.3.6 Counting temporal reasoning QAs . . . . . . . . . . . . . . . . . . 41 A.3.7 GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A.3.8 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 A.3.8.1 ActivityNet event generation . . . . . . . . . . . . . . 43 A.3.8.2 QVHighlights event generation . . . . . . . . . . . . . 45 A.3.8.3 Sequential QA generation . . . . . . . . . . . . . . . . 46 A.3.8.4 Cause-effect QA generation . . . . . . . . . . . . . . . 47 A.3.8.5 Means-to-an-end QA generation . . . . . . . . . . . . 48 A.3.8.6 QA verification . . . . . . . . . . . . . . . . . . . . . 49 A.3.8.7 Options generation . . . . . . . . . . . . . . . . . . . . 50 A.4 Additional experiment results . . . . . . . . . . . . . . . . . . . . . 52 A.4.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.4.2 Teaser examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.4.2.1 GPT-4V . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.4.2.2 Gemini-1.5-Pro . . . . . . . . . . . . . . . . . . . . . 54 A.4.2.3 Claude3-Opus . . . . . . . . . . . . . . . . . . . . . . 55 A.4.2.4 Reka-Core . . . . . . . . . . . . . . . . . . . . . . . . 56 A.4.2.5 GPT-4o . . . . . . . . . . . . . . . . . . . . . . . . . . 57 A.4.3 Open source performance on mini test set . . . . . . . . . . . . . . 59 | - |
| dc.language.iso | en | - |
| dc.subject | 基準測試集 | zh_TW |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 多模態大型語言模型 | zh_TW |
| dc.subject | 跨時間推理 | zh_TW |
| dc.subject | 影片理解 | zh_TW |
| dc.subject | Deep Learning | en |
| dc.subject | Video Understanding | en |
| dc.subject | Reasoning Across Time | en |
| dc.subject | Multimodal Large Language Model | en |
| dc.subject | Benchmark | en |
| dc.title | 影片事件之間跨時間理解的基準測試集 | zh_TW |
| dc.title | REXTIME: A Benchmark Suite for Reasoning-Across-Time in Videos | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 陳祝嵩;楊福恩 | zh_TW |
| dc.contributor.oralexamcommittee | Chu-Song Chen;Fu-En Yang | en |
| dc.subject.keyword | 影片理解,跨時間推理,多模態大型語言模型,基準測試集,深度學習, | zh_TW |
| dc.subject.keyword | Video Understanding,Reasoning Across Time,Multimodal Large Language Model,Benchmark,Deep Learning, | en |
| dc.relation.page | 59 | - |
| dc.identifier.doi | 10.6342/NTU202500991 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-06-05 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| dc.date.embargo-lift | 2025-06-19 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 11.27 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
