使用 U­-Net 及其壓縮版本來進行歌聲分離

Yu-Li Wang; 王俞禮

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79698

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星(Jyh-Shing Jang)
dc.contributor.author	Yu-Li Wang	en
dc.contributor.author	王俞禮	zh_TW
dc.date.accessioned	2022-11-23T09:07:57Z	-
dc.date.available	2021-09-02
dc.date.available	2022-11-23T09:07:57Z	-
dc.date.copyright	2021-09-02
dc.date.issued	2021
dc.date.submitted	2021-08-24
dc.identifier.citation	F. Bao and W. H. Abdulla. Noise masking method based on an effective ratio mask estimation in gammatone channels. APSIPA Transactions on Signal and Information Processing, 7:3–4, 2018. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In ISMIR, volume 14, pages 155–160, 2014. S. Boll. Suppression of acoustic noise in speech using spectral subtraction.IEEETransactionsonacoustics,speech,andsignalprocessing, 27(2):113–120, 1979. S. Braun and I. Tashev. A consolidated view of loss functions for supervised deep learning based speech enhancement.arXivpreprintarXiv:2009.12286, 2020. T.S. Chan, T.C. Yeh, Z.C. Fan, H.W. Chen, L. Su, Y.H. Yang, and R. Jang. Vocal activity informed singing voice separation with the ikala dataset. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 718–722. IEEE, 2015. H.S. Choi, J.H. Kim, J. Huh, A. Kim, J.W. Ha, and K. Lee. Phase-aware speech enhancement with deep complex unet. In International Conference on Learning Representations, 2018. F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017. A. Défossez, N. Usunier, L. Bottou, and F. Bach. Music source separation in the waveform domain.arXivpreprintarXiv:1911.13254, 2019. M. Dukhan, Y. Wu, and H. Lu. Qnnpack: open source library for optimized mobile deep learning, 2018. F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with lstm.na, 1999. R. Giri, U. Isik, and A. Krishnaswamy. Attention waveunet for speech enhancement. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), pages 249–253. IEEE, 2019. E. Gusó.On Loss Functions for Music Source Separation. PhD thesis, Zenodo, Aug 2020. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam. Spleeter: a fast and efficient music source separation tool with pretrained models. Journal of Open Source Software, 5(50):2154, 2020. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXivpreprintarXiv:1704.04861, 2017. A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde. Singing voice separation with deep unet convolutional networks. NA, 2017. S. Jetley, N. A. Lord, N. Lee, and P. H. Torr. Learn to pay attention. arXivpreprintarXiv:1804.02391, 2018. M. Khened, V. A. Kollerathu, and G. Krishnamurthi. Fully convolutional multi-scale residual densenets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Medical image analysis, 51:21–45, 2019. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014. S.B.Kotsiantis, I.Zaharakis, andP.Pintelas. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160(1):3–24, 2007. H. Liu, L. Xie, J. Wu, and G. Yang. Channelwise subband input for bettervoice and accompaniment separation on high resolution music.arXivpreprintarXiv:2008.05216, 2020. Y. Liu, B. Thoshkahna, A. Milani, and T. Kristjansson. Voice and accompanimentseparation in music using selfattention convolutional neural network.arXivpreprintarXiv:2003.08954, 2020. A. Liutkus and F. Stater. sigsep/norbert: First official norbert release, 2019. A. Liutkus, F.R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave. The 2016 signal separation evaluation campaign. In P. Tichavský, M. BabaieZadeh, O. J. Michel, and N. ThirionMoreau, editors,Latent Variable Analysisand Signal Separation - 12th International Conference, LVA/ICA2015, Liberec, CzechRepublic, August2528, 2015, Proceedings, pages 323–332, Cham, 2017. Springer International Publishing. A. Liutkus, F.R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave. The 2016 signal separation evaluation campaign. In International conference on latent variable analysis and signal separation, pages 323–332. Springer, 2017. E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux. Cutting music source separation some slakh: A dataset to study the impact of training data quality andquantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), pages 45–49. IEEE, 2019. H. Nakajima, Y. Takahashi, K. Kondo, and Y. Hisaminato. Monaural source enhancement maximizing source to distortion ratio via automatic differentiation.arXivpreprintarXiv:1806.05791, 2018. O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori,S. McDonagh, N. Y. Hammerla, B. Kainz, et al. Attention unet: Learning where tolook for the pancreas.arXivpreprintarXiv:1804.03999, 2018. L. Prétet, R. Hennequin, J. RoyoLetelier, and A. Vaglio. Singing voice separation:A study on training data. In ICASSP 2019-2019 ieee international conference on acoustics, speech and signal processing(icassp), pages 506–510. IEEE, 2019. Z. Rafii, A. Liutkus, F.R. Stöter, S. I. Mimilakis, and R. Bittner. Musdb18a corpusfor music separation.na, 2017. Z. Rafii, A. Liutkus, F.R. Stöter, S. I. Mimilakis, and R. Bittner. The MUSDB18 corpus for music separation, Dec. 2017. Z. Rafii and B. Pardo. Repeating pattern extraction technique (repet): A simple method for music/voice separation. IEEE transactions on audio, speech, and language processing, 21(1):73–84, 2012. B. P. Rohman, K. Paramayudha, and A. Y. Hercuadi. A novel scheme ofspeech enhancement using power spectral subtraction multi-layer perceptron network.Telkomnika, 14(1):181, 2016. O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer assisted intervention, pages 234–241. Springer, 2015. H. R. Roth, L. Lu, N. Lay, A. P. Harrison, A. Farag, A. Sohn, and R. M. Summers. Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Medical image analysis, 45:94–107,2018. D. Samuel, A. Ganeshan, and J. Naradowsky. Meta-learning extractors for music source separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 816–820. IEEE, 2020. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2:Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018. P. Shaw, J. Uszkoreit, and A. Vaswani. Selfattention with relative position representations.arXivpreprintarXiv:1803.02155, 2018. D. Stoller, S. Ewert, and S. Dixon. Adversarial semi-supervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 2391–2395. IEEE, 2018. D. Stoller, S. Ewert, and S. Dixon. Waveunet: A multiscale neural network for end-to-end audio source separation.arXivpreprintarXiv:1806.03185, 2018. F.R. Stöter and A. Liutkus. museval.na, 2018. F.R. Stöter, A. Liutkus, and N. Ito. The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pages 293–305. Springer, 2018. N. Takahashi, N. Goswami, and Y. Mitsufuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. In 2018 16th International Workshop on Acoustic Signal Enhancement(IWAENC),pages 106–110. IEEE, 2018. E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen, and D. P. Ellis. Improving universal sound separation using sound classification. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP),pages 96–100. IEEE, 2020. S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 261–265. IEEE, 2017. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,and I. Polosukhin. Attention is all you need.arXivpreprintarXiv:1706.03762, 2017. E. Vincent, R. Gribonval, and C. Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006. N. Wiener. Extrapolation, interpolation and smoothing of stationary. Time Series, with Engineering Applications, 1949. C.J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood,E. Isaac, Y. Jia, B. Jia, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposiumon High Performance Computer Architecture(HPCA), pages 331–344. IEEE, 2019. H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363.PMLR, 2019.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79698	-
dc.description.abstract	歌聲分離領域旨在將音樂中的「主唱音軌」與「伴奏音軌」分離出，可以在 time domain 或是 frequency domain 實現，後者是本研究的重點。深度學習已在現今聲音分離領域中是不可或缺的方法，本研究主要基於 Ronneberger 等人的 U-Net 架構，用於分割生物醫學影像有很好的效果，本論文基於此架構，用於訓練頻譜圖的切割。基於 ratio mask filter 與 Wiener filter 理論，改善現有的 U-Net 模型，在模型的輸出有凸波異常時，可以適時矯正（伴奏 SDR 由 13.805 提升至 14.288）；以注意力機制的 attention gate 與 self-attention 改善 U-Net 模型，讓模型可以學到有規律節奏的聲音（伴奏 SDR 由 13.805 提升至 14.457）；基於先前頻譜刪減（spectral subtraction）的研究，調整各頻段刪減幅度至最佳，以提升模型輸出，但本研究提出的方法與先前研究提出的刪減幅度相較起來，並無有效提升（伴奏 SDR：baseline—13.805、先前研究—14.031、本次研究—13.895）；對 U-Net 進行模型剪枝（model pruning）並最大化保留效能（模型大小由 118.9MB 減少至 59.8MB，伴奏 SDR 由 12.989 降低至 12.771）；調整最佳的模型量化（model quantization）參數，以不損失太多效能（模型大小由 118.9MB 減少至 4.75MB，伴奏 SDR 由 12.989 降低至 11.184）。實驗使用到公開的資料集包含：MUSDB18、DSD100、MedleyDB、iKala，非公開的資料集包含：Ke（捷奏錄音室-柯老師）。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T09:07:57Z (GMT). No. of bitstreams: 1 U0001-2408202115100900.pdf: 3846493 bytes, checksum: 7ab42a1c38595946e1aa084a087a6f63 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	口試委員審定書 — i 致謝 — iii 摘要 — v Abstract — vii 目錄 — ix 圖目錄 — xiii 表目錄 — xvii 第一章緒論 1 1.1 動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向與主要貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 章節概要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 第二章文獻探討 3 2.1 傳統方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 重複結構擷取. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 深度學習法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 濾波處理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1.1 Ratio Mask Filter. . . . . . . . . . . . . . . . . . . . 5 2.2.1.2 Wiener Filter. . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1.3 頻譜刪減法. . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 深度神經模型UNet. . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2.1 Spleeter. . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2.2 Demucs. . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 注意力模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3.1 Self-attention. . . . . . . . . . . . . . . . . . . . . . 9 2.2.3.2 Attention Gate. . . . . . . . . . . . . . . . . . . . . . 10 2.3 模型壓縮方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 模型剪枝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1.1 深度可分卷積. . . . . . . . . . . . . . . . . . . . . . 12 2.3.1.2 Inverted Residuals與Linear Bottlenecks. . . . . . . . 13 2.3.2 模型量化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2.1 Quantized Neural Networks package. . . . . . . . . 15 第三章資料集簡介 17 3.1 MusDB18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 DSD100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.2 MedleyDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Museval模型測試指標. . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 iKala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 捷奏錄音室柯老師. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 其餘資料收集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 第四章研究方法 23 4.1 問題定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 實驗環境. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 評量指標. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 實驗設計與方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.1 神經模型訓練設定. . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.2 濾波實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4.2.1 頻譜刪減法. . . . . . . . . . . . . . . . . . . . . . . 29 4.4.3 注意力模型實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.3.1 Selfattention架構實驗. . . . . . . . . . . . . . . . . 33 4.4.3.2 Attention Gate架構實驗. . . . . . . . . . . . . . . . 34 4.4.4 模型剪枝實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.5 模型量化實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 第五章實驗結果討論與錯誤分析 39 5.1 實驗一：比較Ratio Mask與Wiener Filter的效果比較. . . . . . . . 39 5.2 實驗二：頻譜刪減法效果比較. . . . . . . . . . . . . . . . . . . . . 41 5.3 實驗三：不同注意力模型效果比較. . . . . . . . . . . . . . . . . . 45 5.4 實驗四：模型剪枝效果比較. . . . . . . . . . . . . . . . . . . . . . 48 5.5 實驗五：模型量化效果比較. . . . . . . . . . . . . . . . . . . . . . 50 第六章結論與未來展望 53 6.1 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 參考文獻 57 附錄A — 提出模型的完整訓練 65 A.1 U-Net6 (Sattn) 的 L1 loss 下降趨勢. . . . . . . . . . . . . . . . . . . 65 A.2 U-Net6 (DSConB) 的 L1 loss 下降趨勢. . . . . . . . . . . . . . . . . 65 A.3 U-Net6 (IRB) 的 L1 loss 下降趨勢. . . . . . . . . . . . . . . . . . . . 66 A.4 以 Museval 指標與目前技術比較. . . . . . . . . . . . . . . . . . . . 66 A.5 有無使用Musdb18資料集訓練之差異. . . . . . . . . . . 68
dc.language.iso	zh-TW
dc.title	使用 U-Net 及其壓縮版本來進行歌聲分離	zh_TW
dc.title	Singing Voice Separation Using U-Net and Its Compressed Version	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李宏毅(Hsin-Tsai Liu),楊奕軒(Chih-Yang Tseng)
dc.subject.keyword	歌聲分離,U-Net,注意力模型,頻譜刪減,深度模型壓,	zh_TW
dc.subject.keyword	singing voice separation,U-Net,attention based model,spectrum subtraction,network compression,	en
dc.relation.page	68
dc.identifier.doi	10.6342/NTU202102677
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-08-25
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-2408202115100900.pdf	3.76 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。