Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7267
Title: | 跨模態共注意視聽事件定位 Cross-Modality Co-Attention for Audio-Visual Event Localization |
Authors: | Yan-Bo Lin 林彥伯 |
Advisor: | 王鈺強 |
Keyword: | 視聽特徵,雙模態,跨模態,事件定位,深度學習,機器學習,電腦視覺, Audio-Video Features,Dual Modality,Cross Modality,Event localization,Deep learning,Machine learning,Computer vision, |
Publication Year : | 2019 |
Degree: | 碩士 |
Abstract: | 視聽事件定位需要人類透過聯合觀察跨模態視聽信息及跨越視頻幀的事件標籤。為了解決這個任務,我們提出了一個跨模式的深度學習框架專注在共同關注視頻事件定位。我們提出的模型能夠利用幀內和幀間時間及視覺信息,以及同時間的音訊資訊,利用觀察上述的三種資訊,來實現共注意視覺物件。搭配視覺,連續觀察到的時間及音訊資訊,我們的模型實現了有新穎的能力來提取空間訊息/時間特徵以改進視聽事件定位。而且,我們的模型能夠產生實例級別的視覺注意力,這將識別圖像最有可能發出聲音的區域/位置,並且在同時有相同物體的場景中找出真正發聲的物體。在實驗設計方面,我們利用了最新穎的方法來跟我們所提出的共注意模組進行比較,並且使用公開的數據集來驗證我們提出方法的有效性,其中我們的實驗結果準確度超過目前現有的方法,可視化的結果也能印證我們提出的架構能達到實例級別的視覺注意力。 Audio-visual event localization requires one to identify the event labelacross video frames by jointly observing visual and audio information. To address this task, we propose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking both audio and visual features at each time segment as inputs, our proposed model learns global and local event information in a sequence to sequence manner. Besides, we also propose a deep learning framework of cross-modality co-attention for audio-visual event localization. The co-attention framework can be applied on existing methods and AVSDN. Our co-attention modelis able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities.With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover,our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework,with ablation studies performed to verify the design of our propose network model. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7267 |
DOI: | 10.6342/NTU201902224 |
Fulltext Rights: | 同意授權(全球公開) |
Appears in Collections: | 電信工程學研究所 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-108-1.pdf | 3.34 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.