改良U­-Net對歌曲人聲分離效果

Hsiang-Yu Huang; 黃翔宇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15465

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星(JyhShing Roger Jang)
dc.contributor.author	Hsiang-Yu Huang	en
dc.contributor.author	黃翔宇	zh_TW
dc.date.accessioned	2021-06-07T17:40:51Z	-
dc.date.copyright	2020-07-22
dc.date.issued	2020
dc.date.submitted	2020-07-20
dc.identifier.citation	[1] P.S. Huang, S. D. Chen, P. Smaragdis, and M. HasegawaJohnson, “Singingvoice separation from monaural recordings using robust principal component analysis,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 57–60. [2] P.S. Huang, M. Kim, M. HasegawaJohnson, and P. Smaragdis, “Singingvoice separation from monaural recordings using deep recurrent neural networks.” in ISMIR, 2014, pp. 477–482. [3] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “Medleydb: A multitrack dataset for annotationintensive mir research.” in ISMIR, vol. 14, 2014, pp. 155–160. [4] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241. [5] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep unet convolutional networks,” 2017. [6] R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications. McGrawHill New York, 1986, vol. 31999. [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. [8] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal, “A comparative performance study of several pitch detection algorithms,” IEEE Transactions on Acoustics,Speech, and Signal Processing, vol. 24, no. 5, pp. 399–418, 1976. [9] J. Salamon, E. Gómez, D. P. Ellis, and G. Richard, “Melody extraction from polyphonic music signals: Approaches, applications, and challenges,” IEEE Signal Processing Magazine, vol. 31, no. 2, pp. 118–134, 2014. [10] A. L. Berenzweig and D. P. Ellis, “Locating singing voice segments within music signals,” in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575). IEEE, 2001, pp. 119–122. [11] A. Cichocki, R. Zdunek, and S.i. Amari, “New algorithms for nonnegative matrix factorization in applications to blind source separation,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5. IEEE, 2006, pp. V–V. [12] Z. Rafii and B. Pardo, “Repeating pattern extraction technique (repet): A simple method for music/voice separation,” IEEE transactions on audio, speech, and language processing, vol. 21, no. 1, pp. 73–84, 2012. [13] I. T. Jolliffe, “Principal components in regression analysis,” in Principal component analysis. Springer, 1986, pp. 129–155. [14] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007. [15] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation,vol. 9, no. 8, pp. 1735–1780, 1997. [16] J. Muth, S. Uhlich, N. Perraudin, T. Kemp, F. Cardinaux, and Y. Mitsufuji, “Improving dnnbased music source separation using phase features,” arXiv preprint arXiv:1807.02710, 2018. [17] D. Stoller, S. Ewert, and S. Dixon, “Adversarial semisupervised audio source separation applied to singing voice extraction,” in 2018 IEEE International Conference onAcoustics, SpeechandSignalProcessing(ICASSP). IEEE,2018, pp.2391–2395. [18] S. R. Park and J. Lee, “A fully convolutional neural network for speech enhancement,” arXiv preprint arXiv:1609.07132, 2016. [19] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech denoising,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014. [20] T.S. Chan, T.C. Yeh, Z.C. Fan, H.W. Chen, L. Su, Y.H. Yang, and R. Jang, “Vocal activity informed singing voice separation with the ikala dataset,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 718–722. [21] A. Liutkus, F.R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 2528, 2015, Proceedings, P. Tichavský, M. BabaieZadeh, O. J. Michel, and N. ThirionMoreau, Eds. Cham: Springer International Publishing, 2017, pp. 323–332. [22] Z. Rafii, A. Liutkus, F.R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [Online]. Available:https://doi.org/10.5281/zenodo.1117372 [23] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113– 120, 1979. [24] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134. [25] R. Hennequin, A. Khlif, F. Voituret, and M. Moussalam, “Spleeter: A fast and stateofthe art music source separation tool with pretrained models,” in Proc. International Society for Music Information Retrieval Conference, 2019. [26] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019. [27] Y. Luo and N. Mesgarani, “Convtasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019. [28] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15465	-
dc.description.abstract	現今，深度學習技術已經成為歌曲人聲分離的主流方法。本篇論文將專注於研究對於歌曲人聲分離最經典的深度學習模型架構「U-Net」，主要探討3大部分，第一是比較Ronneberger最初提出的U-Net架構，與Jansson提出的U-Net架構對於人聲分離的效果有何差異。第二是擷取前兩個U-Net模型架構的特色，提出一新的U-Net架構，研究是否能改善模型分離人聲和背景音樂的結果。第三是觀察模型分離人聲的結果，研究是否能使用頻譜扣除(spectral subtraction)作為後處理方式，提升歌曲人聲分離的表現。此次研究使用的資料集包括iKala、DSD100、MedleyDB、MUSDB18，並另外有與外面的音樂工作室合作，取得分軌的音樂資料900首作為訓練資料。在研究結果的部分，採用Vincent提出的Source-toDistortion Ratio (SDR)、Source-to-Interferences Ratio (SIR)、Sources-to-Artifacts Ratio (SAR)這三個指標，對各個模型的人聲分離結果進行評估。最後我們使用本次研究中提出的新的模型架構和後處理作法，與目前最新公開的音樂來源分離工具Spleeter和Demucs比較分離人聲效果，結果顯示我們的研究作法在評量指標上，整體比Spleeter和Demucs的分離人聲效果要更好一些。	zh_TW
dc.description.abstract	Nowadays, deep learning has become the mainstream method of singing voice separation. This study focuses on the investigation of the most classic deep learning model architecture ”U-Net” for singing voice separation. The thesis can be divided into three parts. The first part is to compare difference between the U-Net architectures proposed by Ronneberger and Jansson, respectively. The second part proposes a new U-Net model which combines characteristics of the aforementioned two models to see if the new model can improve the results separation. The third part explores whether spectral subtraction can be used for post-processing in order to improve the performance of singing voice separation. The datasets used in this research include iKala, DSD100, MedleyDB, and MUSDB18. In addition, we has acquired900 tracks of sub-track music for model training. For performance evaluation, we used the indicators of Source-to-Distortion Ratio (SDR), Source-to-Interferences Ratio (SIR),and Sources-to-Artifacts Ratio (SAR) proposed by Vincent. Finally, we compared our result of singing voice separation with the latest publicly available music source separation tools, Spleeter and Demucs, and found that our model compares favorably with these two models in terms of the above indicators.	en
dc.description.provenance	Made available in DSpace on 2021-06-07T17:40:51Z (GMT). No. of bitstreams: 1 U0001-2007202015240300.pdf: 2028614 bytes, checksum: 9b4d7f379184fa6f18c32e0177a03ece (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝 v 摘要 vii Abstract ix 1 緒論 1 1.1 主題簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 方法簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 章節概述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 文獻探討 5 2.1 過去傳統方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 傅立葉轉換 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 短時傅立葉轉換 . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 主成分分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.4 非負矩陣分解 . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 深度學習方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 監督式學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 循環神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 長短期記憶 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.4 卷積神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 資料集簡介 13 3.1 訓練資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Ke 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 myMusic 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 測試資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 iKala 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 DSD100 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 MedleyDB 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.4 MUSDB18 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 研究方法 19 4.1 問題定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 模型介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.1 Ronneberger’s UNet . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.2 Jansson’s UNet . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.3 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 實驗方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1 訓練方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.2 Twomodel structure . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.3 後處理方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 實驗設計 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5 評量指標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 實驗結果討論、錯誤分析 29 5.1 實驗一: Ronneberger’s UNet 與 Jansson’s UNet 比較 . . . . . . . . . 29 5.2 實驗二: Baseline 與 Proposed method 比較 . . . . . . . . . . . . . . . . 29 5.3 實驗三: 單一模型架構與雙模型架構比較 . . . . . . . . . . . . . . . . 31 5.4 實驗四: 隨機片段訓練與完整片段訓練比較 . . . . . . . . . . . . . . . 33 5.5 實驗五: 歌曲人聲分離有無使用頻譜扣除比較 . . . . . . . . . . . . . 34 5.6 實驗六: 歌曲人聲分離有無使用頻譜正規化比較 . . . . . . . . . . . . 35 5.7 實驗七: 使用後處理的 Proposed method 與現有的歌曲人聲分離技術比較 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6 結論與未來展望 43 6.1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliography 47
dc.language.iso	zh-TW
dc.title	改良U-Net對歌曲人聲分離效果	zh_TW
dc.title	On the Improvement of Singing Voice Separation Using U-Net	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	楊奕軒(Yi-Hsuan Yang),蔡銘峰(Ming-Feng Tsai)
dc.subject.keyword	歌曲人聲分離,U-Net,後處理,頻譜扣除,	zh_TW
dc.subject.keyword	singing voice separatio,U-Net,post-processing,spectral subtraction,	en
dc.relation.page	50
dc.identifier.doi	10.6342/NTU202001653
dc.rights.note	未授權
dc.date.accepted	2020-07-21
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-2007202015240300.pdf 目前未授權公開取用	1.98 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。