結合時間錨點強化影片大型語言模型於密集事件生成之應用

鄭惟元; Wei-Yuan Cheng

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97407

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王鈺強	zh_TW
dc.contributor.advisor	Yu-Chiang Frank Wang	en
dc.contributor.author	鄭惟元	zh_TW
dc.contributor.author	Wei-Yuan Cheng	en
dc.date.accessioned	2025-06-05T16:08:12Z	-
dc.date.available	2025-06-06	-
dc.date.copyright	2025-06-05	-
dc.date.issued	2025	-
dc.date.submitted	2025-05-27	-
dc.identifier.citation	[1] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005. [2] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. [4] J.-J. Chen, Y.-C. Liao, H.-C. Lin, Y.-C. Yu, Y.-C. Chen, and Y.-C. F. Wang. Rex-time: A benchmark suite for reasoning-across-time in videos. arXiv preprint arXiv:2406.19392, 2024. [5] S. Chen and Y.-G. Jiang. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8425–8435, 2021. [6] Z. Chen, Z. Zhao, H. Luo, H. Yao, B. Li, and J. Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. arXiv preprint arXiv:2403.00425, 2024. [7] Z. Cheng, S. Leng, H. Zhang, Y. Xin, X. Li, G. Chen, Y. Zhu, W. Zhang, Z. Luo, D. Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. [8] A. Deng, Z. Gao, A. Choudhuri, B. Planche, M. Zheng, B. Wang, T. Chen, C. Chen, and Z. Wu. Seq2time: Sequential knowledge transfer for video llm temporal grounding. arXiv preprint arXiv:2411.16932, 2024. [9] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. [10] T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021. [11] S. Fujita, T. Hirao, H. Kamigaito, M. Okumura, and M. Nagata. Soda: Story oriented dense video captioning evaluation framework. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 517–531. Springer, 2020. [12] J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017. [13] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. [14] Y. Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang. Trace: Temporal grounding video llm via causal event modeling. arXiv preprint arXiv:2410.05643, 2024. [15] Y. Guo, J. Liu, M. Li, X. Tang, X. Chen, and B. Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. arXiv preprint arXiv:2405.13382, 2024. [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [17] B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14271–14280, 2024. [18] D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz. Lita: Language instructed temporal-localization assistant. arXiv preprint arXiv:2403.19046, 2024. [19] J. Jang, J. Park, J. Kim, H. Kwon, and K. Sohn. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023. [20] Y. Jin, Z. Sun, K. Xu, L. Chen, H. Jiang, Q. Huang, C. Song, Y. Liu, D. Zhang, Y. Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161, 2024. [21] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017. [22] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. [23] J. Lei, T. L. Berg, and M. Bansal. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021. [24] S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. [25] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. [26] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. [27] L. Li, Z. Gan, K. Lin, C.-C. Lin, Z. Liu, C. Liu, and L. Wang. Lavender: Unifying video-language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23119–23129, 2023. [28] Y. Li, C. Wang, and J. Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2025. [29] B. Lin, Y. Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. [30] K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. [31] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. [32] T. Liu, S. Guo, L. Bianco, D. Calandriello, Q. Berthet, F. Llinares, J. Hoffmann, L. Dixon, M. Valko, and M. Blondel. Decoding-time realignment of language models. arXiv preprint arXiv:2402.02992, 2024. [33] Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Li, et al. Nvila: Efficient frontier visual language models. arXiv preprint arXiv:2412.04468, 2024. [34] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. [35] W. Moon, S. Hyun, S. Park, D. Park, and J.-P. Heo. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023. [36] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han. Streamlined dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6588–6597, 2019. [37] Z. Pang, Z. Xie, Y. Man, and Y.-X. Wang. Frozen transformers in language models are effective visual encoder layers. arXiv preprint arXiv:2310.12973, 2023. [38] L. Qian, J. Li, Y. Wu, Y. Ye, H. Fei, T.-S. Chua, Y. Zhuang, and S. Tang. Momentor: Advancing video large language model with fine-grained temporal reasoning. arXiv preprint arXiv:2402.11435, 2024. [39] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [40] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. [41] S. Ren, L. Yao, S. Li, X. Sun, and L. Hou. Timechat: A time-sensitive multi-modal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024. [42] X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434, 2024. [43] R. Shi, Y. Chen, Y. Hu, A. Liu, H. Hajishirzi, N. A. Smith, and S. S. Du. Decoding-time language model alignment with multiple objectives. arXiv preprint arXiv:2406.18853, 2024. [44] M. Suin and A. Rajagopalan. An efficient framework for dense video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12039–12046, 2020. [45] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. [46] J. Wang, Y. Ge, R. Yan, Y. Ge, K. Q. Lin, S. Tsutsui, X. Lin, G. Cai, J. Wu, Y. Shan, et al. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023. [47] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. [48] T. Wang, J. Zhang, F. Zheng, W. Jiang, R. Cheng, and P. Luo. Learning grounded vision-language representation for versatile understanding in untrimmed videos. arXiv preprint arXiv:2303.06378, 2023. [49] T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6847–6857, 2021. [50] T. Wang, H. Zheng, M. Yu, Q. Tian, and H. Hu. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 31(5):1890–1900, 2020. [51] X. Wang, F. Cheng, Z. Wang, H. Wang, M. M. Islam, L. Torresani, M. Bansal, G. Bertasius, and D. Crandall. Timerefine: Temporal grounding with time refining video llm. arXiv preprint arXiv:2412.09601, 2024. [52] X. Wang, J. Pan, L. Ding, and C. Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024. [53] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. [54] Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y. Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023. [55] Y. Wang, K. Li, X. Li, J. Yu, Y. He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y. Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. In European Conference on Computer Vision, pages 396–416. Springer, 2024. [56] S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan. Videogrounding-dino: Towards open-vocabulary spatio-temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18909–18918, 2024. [57] H. Wu, H. Liu, Y. Qiao, and X. Sun. Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary enrichment and online refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18699–18708, 2024. [58] L. Xu, Y. Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. [59] F. Xue, Y. Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y. Fang, H. Tang, S. Yang, Z. Liu, et al. Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188, 2024. [60] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. [61] P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y. Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. [62] Y. Zhang, B. Li, h. Liu, Y. j. Lee, L. Gui, D. Fu, J. Feng, Z. Liu, and C. Li. Llava-next: A strong zero-shot video understanding model, April 2024. [63] Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. [64] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023. [65] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren. Distance-iou loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12993–13000, 2020. [66] L. Zhou, C. Xu, and J. Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [67] B. Zhu, B. Lin, M. Ning, Y. Yan, J. Cui, H. Wang, Y. Pang, W. Jiang, J. Zhang, Z. Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97407	-
dc.description.abstract	影片的密集事件描述（Dense Video Captioning）旨在按時間順序解析並描述影片中的所有事件。近年來，許多新穎且有效的方法會運用大型語言模型（LLMs）來為影片片段提供更詳細的描述。然而，現有的影片大型語言模型（VideoLLMs）仍無法精準識別未修剪影片中的事件邊界，導致生成的描述與實際事件未能有很好地對應。為了解決上述的挑戰，本研究提出 TA-Prompting 框架，透過引入「時間錨點」（Temporal Anchors）來提升影片大型語言模型（VideoLLMs）的效能。這些時間錨點能學習精準地定位影片事件，並引導影片大型語言模型進行具備時序感知能力的影片事件理解。在推論（inference）階段，考量到每部影片所蘊含的事件數量都有所差異，且需妥善決定輸出描述序列，我們引入了「事件連貫性取樣策略」（event coherent sampling）。此策略能有效處理任意數量的事件，並篩選出既具備充分時序連貫性，其文字描述又能精準對應影片片段實際視覺內容的事件。我們在多個基準資料集上進行了充分的實驗，結果表明，相較於目前最先進的影片大型語言模型，TA-Prompting 在影片密集事件描述及相關時序理解任務（如片段檢索與時序問答）上均展現出更優越的效能。	zh_TW
dc.description.abstract	Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-06-05T16:08:12Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-06-05T16:08:12Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables x Chapter 1 Introduction 1 Chapter 2 Related Works 5 2.1 Video Large Language Model . . . . . . . . . . . . . . . . . . . . . 5 2.2 Event-Based Captioning and Understanding . . . . . . . . . . . . . . 6 Chapter 3 Method 7 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Temporal-Anchored VideoLLMs . . . . . . . . . . . . . . . . . . . . 7 3.2.1 Learning to Predict Temporal Anchors . . . . . . . . . . . . . . . . 7 3.2.2 Temporal-Aware Video Event Captioning . . . . . . . . . . . . . . 10 3.3 Inference-time Event Coherent Sampling . . . . . . . . . . . . . . . 11 Chapter 4 Experiment 13 4.1 Dataset and Experimental Setup . . . . . . . . . . . . . . . . . . . . 13 4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 17 4.5 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 5 Conclusion 22 References 23 Appendix A — Advanced details and experiment 33 A.1 Detailed Evaluation Explanation . . . . . . . . . . . . . . . . . . . . 33 A.2 Additional Training Details . . . . . . . . . . . . . . . . . . . . . . 34 A.2.1 Temporal-Aware Video Event Captioning learning template . . . . . 34 A.2.2 TA-Prompting’s pre-training process . . . . . . . . . . . . . . . . . 34 A.3 The effectiveness of Temporal-Aware Video Event Captioning training 35 A.4 Details of Event Coherent Sampling . . . . . . . . . . . . . . . . . . 36 A.5 More Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 38 Appendix B — Review and Defense Discussion 42	-
dc.language.iso	en	-
dc.subject	時間感知理解	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	事件片段檢索	zh_TW
dc.subject	影片大型語言模型	zh_TW
dc.subject	影片密集事件描述	zh_TW
dc.subject	Deep Learning	en
dc.subject	Dense Video Captioning	en
dc.subject	Video Large Language Models	en
dc.subject	Moment Retrieval	en
dc.subject	Temporal-aware Understanding	en
dc.title	結合時間錨點強化影片大型語言模型於密集事件生成之應用	zh_TW
dc.title	TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳祝嵩;楊福恩	zh_TW
dc.contributor.oralexamcommittee	Chu-Song Chen;Fu-En Yang	en
dc.subject.keyword	影片密集事件描述,影片大型語言模型,事件片段檢索,時間感知理解,深度學習,	zh_TW
dc.subject.keyword	Dense Video Captioning,Video Large Language Models,Moment Retrieval,Temporal-aware Understanding,Deep Learning,	en
dc.relation.page	46	-
dc.identifier.doi	10.6342/NTU202500983	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-05-28	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2025-06-06	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	27.08 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。