請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88100| 標題: | 透過問題相關幀定位和空間引導改善影片理解 Improving Video Understanding through Reliable Question-Relevant Frame Localization and Spatial Guidance |
| 作者: | 蔡秉辰 Bing-Chen Tsai |
| 指導教授: | 徐宏民 Winston Hsu |
| 關鍵字: | 影片問答,視覺問答,知識蒸餾,視覺與語言,多模態學習,機器學習, video question answering,visual question answering,knowledge distillation,vision and language,multimodal learning,machine learning, |
| 出版年 : | 2023 |
| 學位: | 碩士 |
| 摘要: | 影片問答是一項具有挑戰性的任務,需要模型在豐富的影片內容中準確地識別和理解相關信息。傳統方法試圖通過考慮視覺問題的關聯性,強調特定幀中的相關信息。然而,由於缺乏因果幀的真實標籤,這種關聯性只能透過「間接地」學習,導致了「聚焦錯誤」(misfocus)的問題。為了解決這個問題,我們提出了一種新的訓練框架,稱為「空間蒸餾與因果幀定位」(Spatial distillation And Reliable Causal frame localization),利用現成的圖像問答模型來幫助影片問答模型更好地理解影片的時間和空間維度中的相關信息。具體而言,我們利用圖像問答模型的視覺問題和答案先驗知識來獲得因果幀的偽標籤,並在時間維度上「直接地」指導影片聚焦在相關聯的幀。此外,由於圖像模型具有出色的空間推理能力,我們通過知識蒸餾將這種知識轉移到影片模型中。我們的方法不依賴於特定模型,在各種基準測試中均優於以前的方法。此外,它在多個影片問答模型(包括預訓練和非預訓練模型)上都能穩定提升性能。 Video Question Answering (Video QA) is a challenging task that requires models to accurately identify and contextualize relevant information within abundant video contents. Conventional approaches attempt to emphasize related information in specific frames by considering the visual-question relationship. However, the absence of ground-truth of causal frames makes such a relationship can only be learned implicitly, leading to the "misfocus" issue. To address this, we propose a novel training pipeline called "Spatial distillation And Reliable Causal frame localization", which leverages an off-the-shelf image QA model to make the video QA model better grasp relevant information in temporal and spatial dimensions of the video. Specifically, we use the visual-question and answer priors from an image QA model to obtain pseudo ground-truth of causal frames and explicitly guide the video QA model in the temporal dimension. Moreover, due to the superior spatial reasoning ability of image models, we transfer such knowledge to video models via knowledge distillation. Our model-agnostic approach outperforms previous methods on various benchmarks. Besides, it consistently improves performance (up to 5%) across several video QA models, including pre-trained and non pre-trained models. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88100 |
| DOI: | 10.6342/NTU202300917 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2028-07-11 |
| 顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf 此日期後於網路公開 2028-07-11 | 5.96 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
