透過解耦學習影片問答中的時空間關係

李信穎; Hsin-Ying Lee

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86518

標題:	透過解耦學習影片問答中的時空間關係 Learning by Decoupling Spatial and Temporal Relations for Video Question Answering
作者:	李信穎 Hsin-Ying Lee
指導教授:	徐宏民 Winston H. Hsu
關鍵字:	機器學習,深度學習,影片理解,時空間推理,影片問答, Machine Learning,Deep Learning,Video Understanding,Spatial-Temporal Modeling,Video Question Answering,
出版年 :	2022
學位:	碩士
摘要:	雖然最近大規模的影片語言預訓練在影片問答方面有了很大的進展，但空間建模的設計沒有圖像語言模型的那麼精緻；現有的時間建模方式也受到模態之間沒有對齊的影響。為了學習精緻的視覺理解，我們將時空建模解耦，並提出了一種結合圖像和影片語言編碼器的混合結構。前者獨立於時間從較大但稀疏採樣的影格中理解空間語義，而後者在較低空間但較高時間解析度下捕捉時間動態。另外，為了幫助影片語言模型學習影片問答的時間關係，我們提出了一種新穎的預訓練目標，即時間引用建模，它要求模型辨別影片序列中事件的時間位置。透過廣泛且詳細的實驗，我們證明這個方法做得比以前在數量級更大的資料集上預訓練的研究更好。 While recent large-scale video-language pre-training made great progress in video question answering, the design of spatial modeling is less fine-grained than that of image language models; existing practices of temporal modeling also suffer from weak and noisy alignment between modalities. To learn fine-grained visual understanding, we decouple spatial-temporal modeling and propose a hybrid pipeline integrating an image and a video-language encoder. The former encodes spatial semantics from larger but sparsely sampled frames independently of time, while the latter models temporal dynamics at lower spatial but higher temporal resolution. To help the video-language model learn temporal relations for video QA, we propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Extensive and detailed experiments demonstrate that our model outperforms previous work that pre-trained on orders of magnitude larger datasets.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86518
DOI:	10.6342/NTU202202028
全文授權:	同意授權(全球公開)
電子全文公開日期:	2022-08-18
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-1.pdf	7.36 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。