改良U­-Net對歌曲人聲分離效果

Hsiang-Yu Huang; 黃翔宇

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15465

Title:	改良U-Net對歌曲人聲分離效果 On the Improvement of Singing Voice Separation Using U-Net
Authors:	Hsiang-Yu Huang 黃翔宇
Advisor:	張智星(JyhShing Roger Jang)
Keyword:	歌曲人聲分離,U-Net,後處理,頻譜扣除, singing voice separatio,U-Net,post-processing,spectral subtraction,
Publication Year :	2020
Degree:	碩士
Abstract:	現今，深度學習技術已經成為歌曲人聲分離的主流方法。本篇論文將專注於研究對於歌曲人聲分離最經典的深度學習模型架構「U-Net」，主要探討3大部分，第一是比較Ronneberger最初提出的U-Net架構，與Jansson提出的U-Net架構對於人聲分離的效果有何差異。第二是擷取前兩個U-Net模型架構的特色，提出一新的U-Net架構，研究是否能改善模型分離人聲和背景音樂的結果。第三是觀察模型分離人聲的結果，研究是否能使用頻譜扣除(spectral subtraction)作為後處理方式，提升歌曲人聲分離的表現。此次研究使用的資料集包括iKala、DSD100、MedleyDB、MUSDB18，並另外有與外面的音樂工作室合作，取得分軌的音樂資料900首作為訓練資料。在研究結果的部分，採用Vincent提出的Source-toDistortion Ratio (SDR)、Source-to-Interferences Ratio (SIR)、Sources-to-Artifacts Ratio (SAR)這三個指標，對各個模型的人聲分離結果進行評估。最後我們使用本次研究中提出的新的模型架構和後處理作法，與目前最新公開的音樂來源分離工具Spleeter和Demucs比較分離人聲效果，結果顯示我們的研究作法在評量指標上，整體比Spleeter和Demucs的分離人聲效果要更好一些。 Nowadays, deep learning has become the mainstream method of singing voice separation. This study focuses on the investigation of the most classic deep learning model architecture ”U-Net” for singing voice separation. The thesis can be divided into three parts. The first part is to compare difference between the U-Net architectures proposed by Ronneberger and Jansson, respectively. The second part proposes a new U-Net model which combines characteristics of the aforementioned two models to see if the new model can improve the results separation. The third part explores whether spectral subtraction can be used for post-processing in order to improve the performance of singing voice separation. The datasets used in this research include iKala, DSD100, MedleyDB, and MUSDB18. In addition, we has acquired900 tracks of sub-track music for model training. For performance evaluation, we used the indicators of Source-to-Distortion Ratio (SDR), Source-to-Interferences Ratio (SIR),and Sources-to-Artifacts Ratio (SAR) proposed by Vincent. Finally, we compared our result of singing voice separation with the latest publicly available music source separation tools, Spleeter and Demucs, and found that our model compares favorably with these two models in terms of the above indicators.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15465
DOI:	10.6342/NTU202001653
Fulltext Rights:	未授權
Appears in Collections:	資訊網路與多媒體研究所

Files in This Item:

File	Size	Format
U0001-2007202015240300.pdf Restricted Access	1.98 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets