應用全捲積網路所達成之弱監督音樂音訊及視訊事件偵測

Jen-Yu Liu; 劉任瑜

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/1199

標題:	應用全捲積網路所達成之弱監督音樂音訊及視訊事件偵測 Weakly-supervised Event Detection for Music Audios and Videos Using Fully-convolutional Networks
作者:	Jen-Yu Liu 劉任瑜
指導教授:	鄭士康
共同指導教授:	楊奕軒
關鍵字:	音樂事件偵測,樂器演奏動作偵測,弱監督學習,音樂自動標籤, music event detection,instrument-playing action detection,weakly-supervised learning,music auto-tagging,
出版年 :	2018
學位:	博士
摘要:	隨著視訊與音訊串流服務的流行，音樂音訊與視訊是現今最受歡迎的娛樂來源之一。音樂與音樂演奏包含相當豐富的資訊。為了能自動分析這些音訊及視訊以進一步進行檢索或教學，我們會想要使用機器學習來幫助偵測各式音訊及視覺事件。然而，機器學習的方法通常需要相當數量的訓練資料。在音訊及視訊中，標示這些訓練資料並不容易，因為手動標示的過程非常花時間而且乏味。在本論文中，我們探討如何以弱監督的方式，僅使用長片斷層級的標示來訓練偵測模型。我們使用全捲積網路來達到音樂音訊與視訊之事件偵測。首先，使用全捲積網路在時間上偵測音樂音訊事件，如曲風、樂器、情緒等，並且使用樂器演奏資料庫來評估模型的表現。接著，我們將發展一個弱監督的架構來實現視訊中的樂器演奏動作偵測。此學習架構包含兩個輔助模型：聲音偵測模型與物體偵測模型。這兩個輔助模型也只使用長片斷層級的標記來訓練。它們將為動作偵測模型提供監督資訊。我們使用5400個經過手動標記的影像畫面來評估此訓練架構的表現。提出之訓練架構在時間與空間上相當大程度地改進了模型表現。 With the growing of audio and video streaming services, music audios and videos are among the most popular sources for entertainment in recent days. There are rich information in music and music playing. In order to automatically analyze these audios and videos for further retrieval or pedagogical purpose, we may want to use machine learning to help with detecting audio and visual events. However, learning-based methods usually require a large amount of training data. In audios and videos, annotating these data are not easy because the process is time-consuming and tedious. In this work, we will see how to train such detection models with only clip-level annotations with weakly-supervised learning. We will use fully-convolutional networks (FCNs) for event detection in music audios and videos. First, we will develop FCNs for temporally detecting music audio events such as genres, instruments, and moods, which will be evaluated on an instrument dataset. Second, we will develop a weakly-supervised framework for detecting instrument-playing actions in videos. The learning framework involves two auxiliary models, a sound model and an object model, which are trained using clip-level annotations only. They will provide supervisions temporally and spatially for the action model. In total 5,400 annotated frames will be used to evaluate the performance of the proposed framework. The proposed framework largely improves the performance temporally and spatially.
URI:	http://tdr.lib.ntu.edu.tw/handle/123456789/1199
DOI:	10.6342/NTU201801365
全文授權:	同意授權(全球公開)
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	17.68 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。