結合時間錨點強化影片大型語言模型於密集事件生成之應用

鄭惟元; Wei-Yuan Cheng

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97407

Title:	結合時間錨點強化影片大型語言模型於密集事件生成之應用 TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
Authors:	鄭惟元 Wei-Yuan Cheng
Advisor:	王鈺強 Yu-Chiang Frank Wang
Keyword:	影片密集事件描述,影片大型語言模型,事件片段檢索,時間感知理解,深度學習, Dense Video Captioning,Video Large Language Models,Moment Retrieval,Temporal-aware Understanding,Deep Learning,
Publication Year :	2025
Degree:	碩士
Abstract:	影片的密集事件描述（Dense Video Captioning）旨在按時間順序解析並描述影片中的所有事件。近年來，許多新穎且有效的方法會運用大型語言模型（LLMs）來為影片片段提供更詳細的描述。然而，現有的影片大型語言模型（VideoLLMs）仍無法精準識別未修剪影片中的事件邊界，導致生成的描述與實際事件未能有很好地對應。為了解決上述的挑戰，本研究提出 TA-Prompting 框架，透過引入「時間錨點」（Temporal Anchors）來提升影片大型語言模型（VideoLLMs）的效能。這些時間錨點能學習精準地定位影片事件，並引導影片大型語言模型進行具備時序感知能力的影片事件理解。在推論（inference）階段，考量到每部影片所蘊含的事件數量都有所差異，且需妥善決定輸出描述序列，我們引入了「事件連貫性取樣策略」（event coherent sampling）。此策略能有效處理任意數量的事件，並篩選出既具備充分時序連貫性，其文字描述又能精準對應影片片段實際視覺內容的事件。我們在多個基準資料集上進行了充分的實驗，結果表明，相較於目前最先進的影片大型語言模型，TA-Prompting 在影片密集事件描述及相關時序理解任務（如片段檢索與時序問答）上均展現出更優越的效能。 Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97407
DOI:	10.6342/NTU202500983
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2025-06-06
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-113-2.pdf Access limited in NTU ip range	27.08 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets