使用機器學習進行無伴奏合唱的歌聲分離

Quinn Myles McGarry; Quinn Myles McGarry

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89896

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星	zh_TW
dc.contributor.advisor	Jyh-Shing Roger Jang	en
dc.contributor.author	Quinn Myles McGarry	zh_TW
dc.contributor.author	Quinn Myles McGarry	en
dc.date.accessioned	2023-09-22T16:35:11Z	-
dc.date.available	2023-11-10	-
dc.date.copyright	2023-09-22	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-09	-
dc.identifier.citation	[1] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP. IEEE, pp. 696–700, 2018. [2] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio. Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019. [3] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on. IEEE, pp. 46–50, 2020. [4] Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., and Zhong, J. “Attention is all you need in speech separation.” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE, 2021. [5] Shahar Lutati, Eliya Nachmani, Lior Wolf, “Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation”, arXiv:2301.10752v1, 2023. [6] Romain Hennequin, Anis Khlif, Felix Voituret, Manuel Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models.” Journal of Open Source Software., vol. 5, no. 50, pp. 2154, 2020. [7] Yi Luo, Jianwei Yu, “Music Source Separation with Band-split RNN”, arXiv:2209.15174, 2022. [8] Olaf Ronneberger, Philipp Fischer, Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv:1505.04597, 2015. [9] Sebastian Rosenzweig, Helena Cuesta, Christof Weiß, Frank Scherbaum, Emilia Gómez, MeInard Müller, “Dagstuhl ChoirSet: A Multitrack Dataset for MIR Research on Choral Singing”, Transactions of the International Society for Music Information Retrieval., vol. 3, no. 1, pp. 98–110, 2020. [10] Helena Cuesta, Emilia Gómez, Agustín Martorell, & Felipe Loáiciga, “Choral Singing Dataset”. 15th International Conference on Music Perception and Cognition (ICMPC). 2018. [11] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam and J. P. Bello, "MedleyDB: A Dataset of Multitrack Audio for Music Research", in 15th International Society for Music Information Retrieval Conference, Oct. 2014. [12] Bittner, R., Wilkins, J., Yip, H., & Bello, J, MedleyDB 2.0: New Data and a System for Sustainable Data Collection. New York, NY, USA: International Conference on Music Information Retrieval (ISMIR-16). 2016. [13] D. Griffin and Jae Lim, "Signal estimation from modified short-time Fourier transform," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, April 1984. [14] Simon Rouard, Francisco Massa, Alexandre Défossez, “Hybrid Transformers for Music Source Separation”, arXiv:2211.08553, 2022. [15] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89896	-
dc.description.abstract	None	zh_TW
dc.description.abstract	In recent years there have been many studies done on the problem of speech separation, which attempts to separate audio of multiple people speaking simultaneously into the audio of each speaker individually. However, audio source separation of multiple simultaneous singers is still not well explored and remains a challenge. This is mainly due to the fact that when people are singing their voices tend to “blend” together much more than when speaking, and multiple vocal lines are often singing the same words, and potentially frequencies, in unison. In order to deal with these issues, we propose a new U-Net based model specifically for the purpose of a cappella singing separation of two singers and compare it to three state-of-the-art speech separation models. There is a large variety in the results of our experiments. The U-Net based network excels at separating music taken from choir datasets, with a max mean SDR of 9.76 dB, but achieves poor results at separating random combinations of singers. The best speech separation network is capable of separating random combinations of singers quite well, with a max mean SDR of 7.64 dB after finetuning but is uncapable of separating samples where the singers are singing the same lyrics simultaneously. This singing separation score is also much lower than the same model’s mean SDR for speech separation of 9.04 dB. These results are quite nuanced and show that singing separation is a different, and overall, more difficult task than speech separation. However, they also show that both a U-Net based network, and one based on contemporary speech separation networks may certainly be capable of performing well on it.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T16:35:11Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-09-22T16:35:11Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i Abstract iii Table of Contents v List of Figures vii List of Tables ix Denotation x Chapter 1 Introduction 1 1.1 Audio Separation 1 1.2 Possible Cases 3 1.3 Applications 5 Chapter 2 Literature Review 7 2.1 Speech Separation 7 2.1.1 TasNet 7 2.1.2 Conv-TasNet 10 2.1.3 Dual-Path RNN 12 2.1.4 SepFormer 13 2.1.5 Separate and Diffuse 14 2.2 Music Source Separation 15 2.2.1 Deezer Spleeter 15 2.2.2 BandSplit RNN 16 2.3 U-Net 17 Chapter 3 Methodology 20 3.1 Datasets 20 3.2 Data Preprocessing 24 3.3 Data Postprocessing 27 3.4 New Network 28 3.5 Experiments 34 3.6 Evaluation Metrics 35 3.7 Training Specifications 36 Chapter 4 Results 37 4.1 Custom Models 37 4.2 Speech Separation Models 40 4.3 Finetuning SepFormer 43 4.4 Subjective Listening Test 46 Chapter 5 Discussion 48 5.1 Custom U-Net Models 48 5.2 Speech Separation Models 51 Chapter 6 Conclusions 56 6.1 Final Conclusions 56 6.2 Future Work 57 References 58	-
dc.language.iso	en	-
dc.subject	音樂源分離	zh_TW
dc.subject	歌唱分離	zh_TW
dc.subject	語音分離	zh_TW
dc.subject	Music information retrieval	en
dc.subject	A cappella separation	en
dc.subject	U-Net	en
dc.subject	Singing separation	en
dc.subject	Speech separation	en
dc.subject	Music source separation	en
dc.subject	Audio source separation	en
dc.subject	Machine learning	en
dc.subject	TasNet	en
dc.title	使用機器學習進行無伴奏合唱的歌聲分離	zh_TW
dc.title	Machine Learning for Source Separation of A Cappella Music	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	蘇黎;曹昱	zh_TW
dc.contributor.oralexamcommittee	Li Su;Yu Tsao	en
dc.subject.keyword	語音分離,歌唱分離,音樂源分離,	zh_TW
dc.subject.keyword	Machine learning,Audio source separation,Music source separation,Speech separation,Music information retrieval,Singing separation,A cappella separation,U-Net,TasNet,	en
dc.relation.page	60	-
dc.identifier.doi	10.6342/NTU202303444	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-08-10	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	1.93 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。