Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15465
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor張智星(JyhShing Roger Jang)
dc.contributor.authorHsiang-Yu Huangen
dc.contributor.author黃翔宇zh_TW
dc.date.accessioned2021-06-07T17:40:51Z-
dc.date.copyright2020-07-22
dc.date.issued2020
dc.date.submitted2020-07-20
dc.identifier.citation[1] P.­S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa­Johnson, “Singing­voice separation from monaural recordings using robust principal component analysis,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 57–60.
[2] P.­S. Huang, M. Kim, M. Hasegawa­Johnson, and P. Smaragdis, “Singing­voice separation from monaural recordings using deep recurrent neural networks.” in ISMIR, 2014, pp. 477–482.
[3] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “Medleydb: A multitrack dataset for annotation­intensive mir research.” in ISMIR, vol. 14, 2014, pp. 155–160.
[4] O. Ronneberger, P. Fischer, and T. Brox, “U­net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer­assisted intervention. Springer, 2015, pp. 234–241.
[5] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep u­net convolutional networks,” 2017.
[6] R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications. McGraw­Hill New York, 1986, vol. 31999.
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
[8] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal, “A comparative performance study of several pitch detection algorithms,” IEEE Transactions on Acoustics,Speech, and Signal Processing, vol. 24, no. 5, pp. 399–418, 1976.
[9] J. Salamon, E. Gómez, D. P. Ellis, and G. Richard, “Melody extraction from polyphonic music signals: Approaches, applications, and challenges,” IEEE Signal Processing Magazine, vol. 31, no. 2, pp. 118–134, 2014.
[10] A. L. Berenzweig and D. P. Ellis, “Locating singing voice segments within music signals,” in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575). IEEE, 2001, pp. 119–122.
[11] A. Cichocki, R. Zdunek, and S.­i. Amari, “New algorithms for non­negative matrix factorization in applications to blind source separation,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5. IEEE, 2006, pp. V–V.
[12] Z. Rafii and B. Pardo, “Repeating pattern extraction technique (repet): A simple method for music/voice separation,” IEEE transactions on audio, speech, and language processing, vol. 21, no. 1, pp. 73–84, 2012.
[13] I. T. Jolliffe, “Principal components in regression analysis,” in Principal component analysis. Springer, 1986, pp. 129–155.
[14] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007.
[15] S. Hochreiter and J. Schmidhuber, “Long short­term memory,” Neural computation,vol. 9, no. 8, pp. 1735–1780, 1997.
[16] J. Muth, S. Uhlich, N. Perraudin, T. Kemp, F. Cardinaux, and Y. Mitsufuji, “Improving dnn­based music source separation using phase features,” arXiv preprint arXiv:1807.02710, 2018.
[17] D. Stoller, S. Ewert, and S. Dixon, “Adversarial semi­supervised audio source separation applied to singing voice extraction,” in 2018 IEEE International Conference onAcoustics, SpeechandSignalProcessing(ICASSP). IEEE,2018, pp.2391–2395.
[18] S. R. Park and J. Lee, “A fully convolutional neural network for speech enhancement,” arXiv preprint arXiv:1609.07132, 2016.
[19] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech denoising,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[20] T.­S. Chan, T.­C. Yeh, Z.­C. Fan, H.­W. Chen, L. Su, Y.­H. Yang, and R. Jang, “Vocal activity informed singing voice separation with the ikala dataset,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 718–722.
[21] A. Liutkus, F.­R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation ­ 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25­28, 2015, Proceedings, P. Tichavský, M. Babaie­Zadeh, O. J. Michel, and N. Thirion­Moreau, Eds. Cham: Springer International Publishing, 2017, pp. 323–332.
[22] Z. Rafii, A. Liutkus, F.­R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [Online]. Available:https://doi.org/10.5281/zenodo.1117372
[23] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113– 120, 1979.
[24] P. Isola, J.­Y. Zhu, T. Zhou, and A. A. Efros, “Image­to­image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[25] R. Hennequin, A. Khlif, F. Voituret, and M. Moussalam, “Spleeter: A fast and stateof­the art music source separation tool with pre­trained models,” in Proc. International Society for Music Information Retrieval Conference, 2019.
[26] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
[27] Y. Luo and N. Mesgarani, “Conv­tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[28] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15465-
dc.description.abstract現今,深度學習技術已經成為歌曲人聲分離的主流方法。本篇論文將專注於研究對於歌曲人聲分離最經典的深度學習模型架構「U­-Net」,主要探討3大部分,第一是比較Ronneberger最初提出的U­-Net架構,與Jansson提出的U­-Net架構對於人聲分離的效果有何差異。第二是擷取前兩個U-­Net模型架構的特色,提出一新的U-­Net架構,研究是否能改善模型分離人聲和背景音樂的結果。第三是觀察模型分離人聲的結果,研究是否能使用頻譜扣除(spectral subtraction)作為後處理方式,提升歌曲人聲分離的表現。此次研究使用的資料集包括iKala、DSD100、MedleyDB、MUSDB18,並另外有與外面的音樂工作室合作,取得分軌的音樂資料900首作為訓練資料。在研究結果的部分,採用Vincent提出的Source­-to­Distortion Ratio (SDR)、Source-­to­-Interferences Ratio (SIR)、Sources-­to­-Artifacts Ratio (SAR)這三個指標,對各個模型的人聲分離結果進行評估。最後我們使用本次研究中提出的新的模型架構和後處理作法,與目前最新公開的音樂來源分離工具Spleeter和Demucs比較分離人聲效果,結果顯示我們的研究作法在評量指標上,整體比Spleeter和Demucs的分離人聲效果要更好一些。zh_TW
dc.description.abstractNowadays, deep learning has become the mainstream method of singing voice separation. This study focuses on the investigation of the most classic deep learning model architecture ”U-­Net” for singing voice separation. The thesis can be divided into three parts. The first part is to compare difference between the U­-Net architectures proposed by Ronneberger and Jansson, re­spectively. The second part proposes a new U­-Net model which combines characteristics of the aforementioned two models to see if the new model can improve the results separation. The third part explores whether spectral subtraction can be used for post-­processing in order to improve the perfor­mance of singing voice separation. The datasets used in this research include iKala, DSD100, MedleyDB, and MUSDB18. In addition, we has acquired900 tracks of sub-­track music for model training. For performance evalua­tion, we used the indicators of Source­-to­-Distortion Ratio (SDR), Source-­to­-Interferences Ratio (SIR),and Sources-­to­-Artifacts Ratio (SAR) proposed by Vincent. Finally, we compared our result of singing voice separation with the latest publicly available music source separation tools, Spleeter and Demucs, and found that our model compares favorably with these two models in terms of the above indicators.en
dc.description.provenanceMade available in DSpace on 2021-06-07T17:40:51Z (GMT). No. of bitstreams: 1
U0001-2007202015240300.pdf: 2028614 bytes, checksum: 9b4d7f379184fa6f18c32e0177a03ece (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents誌謝 v
摘要 vii
Abstract ix
1 緒論 1
1.1 主題簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 方法簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 章節概述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 文獻探討 5
2.1 過去傳統方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 傅立葉轉換 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 短時傅立葉轉換 . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 主成分分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 非負矩陣分解 . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 深度學習方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 監督式學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 循環神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 長短期記憶 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 卷積神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 資料集簡介 13
3.1 訓練資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Ke 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 myMusic 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 測試資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 iKala 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 DSD100 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 MedleyDB 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 MUSDB18 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 研究方法 19
4.1 問題定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 模型介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Ronneberger’s U­Net . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Jansson’s U­Net . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.3 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 實驗方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 訓練方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Two­model structure . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.3 後處理方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 實驗設計 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 評量指標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 實驗結果討論、錯誤分析 29
5.1 實驗一: Ronneberger’s U­Net 與 Jansson’s U­Net 比較 . . . . . . . . . 29
5.2 實驗二: Baseline 與 Proposed method 比較 . . . . . . . . . . . . . . . . 29
5.3 實驗三: 單一模型架構與雙模型架構比較 . . . . . . . . . . . . . . . . 31
5.4 實驗四: 隨機片段訓練與完整片段訓練比較 . . . . . . . . . . . . . . . 33
5.5 實驗五: 歌曲人聲分離有無使用頻譜扣除比較 . . . . . . . . . . . . . 34
5.6 實驗六: 歌曲人聲分離有無使用頻譜正規化比較 . . . . . . . . . . . . 35
5.7 實驗七: 使用後處理的 Proposed method 與現有的歌曲人聲分離技術比較 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 結論與未來展望 43
6.1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Bibliography 47
dc.language.isozh-TW
dc.title改良U­-Net對歌曲人聲分離效果zh_TW
dc.titleOn the Improvement of Singing Voice Separation Using U­-Neten
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.oralexamcommittee楊奕軒(Yi-Hsuan Yang),蔡銘峰(Ming-Feng Tsai)
dc.subject.keyword歌曲人聲分離,U­-Net,後處理,頻譜扣除,zh_TW
dc.subject.keywordsinging voice separatio,U­-Net,post­-processing,spectral sub­traction,en
dc.relation.page50
dc.identifier.doi10.6342/NTU202001653
dc.rights.note未授權
dc.date.accepted2020-07-21
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊網路與多媒體研究所zh_TW
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
U0001-2007202015240300.pdf
  目前未授權公開取用
1.98 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved