利用模態獨立模型於弱監督式音頻-影像事件分析

賴永玄; Yung-Hsuan Lai

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88980

Title:	利用模態獨立模型於弱監督式音頻-影像事件分析 Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Authors:	賴永玄 Yung-Hsuan Lai
Advisor:	王鈺強 Yu-Chiang Frank Wang
Keyword:	音頻-影像學習,跨模態學習,深度學習, Audio-Visual Learning,Cross-Modality Learning,Deep Learning,
Publication Year :	2023
Degree:	碩士
Abstract:	音頻–影像之跨模態學習是多模態機器學習中一個重要的研究領域。目前該領域主要聚焦於「模態對齊」的設定，即假設音頻和影像模態能「同時」提供預測目標的信號。然而在實際應用中，我們更常遇到「非模態對齊」的情況，即只能從音頻或影像「單一」模態中得到目標信號。為了深入研究非模態對齊情境，我們研究「音頻–影像影片事件分析」任務，此任務的目的是在非模態對齊且弱監督式學習的情境下，需識別影片中所有的聲音和影像事件（非模態對齊），並預測事件發生的時間。但是在訓練模型時，僅能使用影片層級的弱標籤（弱監督式學習），意即只能從此弱標籤知道影片中發生了哪些事件，但無法得知這些事件是經由哪種模態（聲音、影像或兩者同時）感知的，也無法確定事件發生的時間。為了應對這一挑戰，本研究提出一種簡單、有效且通用的方法——「影像–音頻標籤細化（VALOR）」。我們分別在音頻和影像模態引入大規模對比式預訓練模型作為模態教師模型，從而獲取含有事件模態和時間資訊的偽標籤，進一步對模型做偽標籤訓練。實驗結果顯示，VALOR 方法相較於基準方法使模型的平均 F-score 提升了 8.0。有趣的是，我們發現使用模態獨立的教師模型產生偽標籤在表現上優於使用模態融合的教師模型，這是因為前者能夠更好地抵抗非模態對齊的干擾。此外，我們的最佳模型在所有指標上都顯著優於前沿水平，使平均 F-score提升了 5.4。最後，我們將 VALOR 方法推廣至「音頻–影像事件定位」任務，同樣在該任務上取得了超越其他方法和模型的最新成果，展現出卓越的普適性。 Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88980
DOI:	10.6342/NTU202301932
Fulltext Rights:	同意授權(全球公開)
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-111-2.pdf	13.82 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets