Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7267
Title: 跨模態共注意視聽事件定位
Cross-Modality Co-Attention for Audio-Visual Event Localization
Authors: Yan-Bo Lin
林彥伯
Advisor: 王鈺強
Keyword: 視聽特徵,雙模態,跨模態,事件定位,深度學習,機器學習,電腦視覺,
Audio-Video Features,Dual Modality,Cross Modality,Event localization,Deep learning,Machine learning,Computer vision,
Publication Year : 2019
Degree: 碩士
Abstract: 視聽事件定位需要人類透過聯合觀察跨模態視聽信息及跨越視頻幀的事件標籤。為了解決這個任務,我們提出了一個跨模式的深度學習框架專注在共同關注視頻事件定位。我們提出的模型能夠利用幀內和幀間時間及視覺信息,以及同時間的音訊資訊,利用觀察上述的三種資訊,來實現共注意視覺物件。搭配視覺,連續觀察到的時間及音訊資訊,我們的模型實現了有新穎的能力來提取空間訊息/時間特徵以改進視聽事件定位。而且,我們的模型能夠產生實例級別的視覺注意力,這將識別圖像最有可能發出聲音的區域/位置,並且在同時有相同物體的場景中找出真正發聲的物體。在實驗設計方面,我們利用了最新穎的方法來跟我們所提出的共注意模組進行比較,並且使用公開的數據集來驗證我們提出方法的有效性,其中我們的實驗結果準確度超過目前現有的方法,可視化的結果也能印證我們提出的架構能達到實例級別的視覺注意力。
Audio-visual event localization requires one to identify the event labelacross video frames by jointly observing visual and audio information. To address this task, we propose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking both audio and visual features at each time segment as inputs, our proposed model learns global and local event information in a sequence to sequence manner. Besides, we also propose a deep learning framework of cross-modality co-attention for audio-visual event localization. The co-attention framework can be applied on existing methods and AVSDN. Our co-attention modelis able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities.With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover,our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework,with ablation studies performed to verify the design of our propose network model.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7267
DOI: 10.6342/NTU201902224
Fulltext Rights: 同意授權(全球公開)
Appears in Collections:電信工程學研究所

Files in This Item:
File SizeFormat 
ntu-108-1.pdf3.34 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved