零樣本歌聲轉換與合成的統一模型

Jui-Te Wu; 吳睿得

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84701

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蘇黎(Li Su)
dc.contributor.author	Jui-Te Wu	en
dc.contributor.author	吳睿得	zh_TW
dc.date.accessioned	2023-03-19T22:21:14Z	-
dc.date.copyright	2022-09-14
dc.date.issued	2022
dc.date.submitted	2022-09-08
dc.identifier.citation	[1] J. Bonada and X. Serra, “Synthesis of the singing voice by performance sampling and spectral models,” IEEE Signal Process. Mag., vol. 24, no. 2, pp. 67–79, 2007. [2] T. Nakano and M. Goto, “Vocalistener2: A singing synthesis system able to mimic a user’s singing in terms of voice timbre changes as well as pitch and dynamics,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011, pp. 453–456. [3] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez, “WGANsing: A multi-voice singing voice synthesizer based on the wasserstein-gan,” in 27th European Signal Processing Conference (EUSIPCO), 2019, pp. 1–5. [4] M. Blaauw, J. Bonada, and R. Daido, “Data efficient voice cloning for neural singing synthesis,” in Proceedings of the IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2019, pp. 6840–6844. [5] S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Korean singing voice synthesis based on auto-regressive boundary equilibrium gan,” in Proceedings of the IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7234–7238. [6] Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, and T. Liu, “DeepSinger: Singing voice synthesis with data mined from the web,” in KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020, pp. 1979–1989. [7] J. Bonada and M. Blaauw, “Semi-supervised learning for singing synthesis timbre,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2021, pp. 7083–7087. [8] E. Nachmani and L. Wolf, “Unsupervised singing voice conversion,” Interspeech, pp. 2583–2587, 2019. [9] A. Polyak, L. Wolf, Y. Adi, and Y. Taigman, “Unsupervised cross-domain singing voice conversion,” in Interspeech, 2020, pp. 801–805. [10] Y. Luo, C. Hsu, K. Agres, and D. Herremans, “Singing voice conversion with dis- entangled representations of singer and vocal technique using variational autoen- coders,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3277–3281. [11] N. Takahashi, M. K. Singh, and Y. Mitsufuji, “Hierarchical disentangled represen- tation learning for singing voice conversion,” in International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–7. [12] S. Liu, Y. Cao, N. Hu, D. Su, and H. Meng, “Fastsvc: Fast cross-domain singing voice conversion with feature-wise linear modulation,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6. [13] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251. [14] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,” in European Signal Processing Conference (EUSIPCO), 2018, pp. 2100–2104. [15] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “StarGAN-VC2: Rethinking con- ditional methods for stargan-based voice conversion,” in Interspeech, 2019. [16] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero- shot voice style transfer with only autoencoder loss,” in Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, 2019, pp. 5210– 5219. [17] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Neural Information Processing Systems, 2017, pp. 6306–6315. [18] D.WuandH.Lee,“One-shotvoiceconversionbyvectorquantization,”inProceed- ings of the IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2020, pp. 7734–7738. [19] D. Wu, Y. Chen, and H. Lee, “VQVC+: one-shot voice conversion by vector quan- tization and u-net architecture,” in Interspeech, 2020, pp. 4691–4695. [20] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosen- berg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Mul- tilingual speech synthesis and cross-language voice cloning,” in Interspeech, 2019, pp. 2080–2084. [21] S. Nercessian, “Zero-shot singing voice conversion,” in Proceedings of the 21th In- ternational Society for Music Information Retrieval Conference (ISMIR), 2020, pp. 70–76. [22] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040. [23] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet- based speech recognition system,” CoRR, 2016. [24] Y.Y.Lin,C.-M.Chien,J.-H.Lin,H.-y.Lee,andL.-s.Lee,“Fragmentvc:Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5939–5943. [25] J. Lin, Y. Y. Lin, C. Chien, and H. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Interspeech, 2021, pp. 836–840. [26] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in ISCA Speech Synthesis Workshop, 2016, p. 125. [27] S.Ö.Arik,M.Chrzanowski,A.Coates,G.F.Diamos,A.Gibiansky,Y.Kang,X.Li, J. Miller, A. Y. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” in Proceedings of the International Conference on Machine Learning (ICML), 2017, pp. 195–204. [28] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Interspeech, 2017, pp. 4006–4010. [29] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019. [30] J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. [31] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations (ICLR), 2021. [32] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to- speech via monotonic alignment search,” Advances in Neural Information Process- ing Systems, vol. 33, pp. 8067–8077, 2020. [33] L. Wan, Q. Wang, A. Papir, and I. Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883. [34] Y.Jia,Y.Zhang,R.J.Weiss,Q.Wang,J.Shen,F.Ren,Z.Chen,P.Nguyen,R.Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to mul- tispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS), 2018, pp. 4485–4495. [35] L. Zhang, C. Yu, H. Lu, C. Weng, C. Zhang, Y. Wu, X. Xie, Z. Li, and D. Yu, “DurIAN-SC: Duration informed attention network based singing voice conversion system,” in Interspeech, 2020, pp. 1231–1235. [36] S.Nercessian,“Improvedzero-shotvoiceconversionusingexplicitconditioningsig- nals,” in Interspeech, 2020, pp. 4711–4715. [37] J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in Interspeech, 2019, pp. 664–668. [38] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learn- ing, 2021, pp. 7748–7759. [39] S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding,” in Interspeech, 2020. [40] M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 1290–1302, 2021. [41] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A convolutional representa- tion for pitch estimation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 161–165. [42] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241. [43] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in Proceed- ings of the IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2018, pp. 4779–4783. [44] S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6649–6658. [45] X.Mao,Q.Li,H.Xie,R.Y.Lau,Z.Wang,andS.PaulSmolley,“Leastsquaresgen- erative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802. [46] R.Yamamoto,E.Song,andJ.Kim,“ParallelWavegan:Afastwaveformgeneration model based on generative adversarial networks with multi-resolution spectrogram,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2020, pp. 6199–6203. [47] C. Chu, F. Yang, Y. Lee, Y. Liu, and S. Wu, “Mpop600: A mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2020, pp. 1647–1652. [48] Z.Duan,H.Fang,B.Li,K.C.Sim,andY.Wang,“TheNUSsungandspokenlyrics corpus: A quantitative comparison of singing and speech,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2013, pp. 1–9. [49] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi- speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019. [50] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [51] J. S. Chung, J. Huh, S. Mun, M. Lee, H. Heo, S. Choe, C. Ham, S. Jung, B. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Interspeech, 2020, pp. 2977–2981. [52] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Informa- tion Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, (NeurIPS), 2020. [53] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. 99- D, no. 7, pp. 1877–1884, 2016. [54] J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, and Z. Zhao, “DiffSinger: Diffusion acoustic model for singing voice synthesis,” CoRR, vol. abs/2105.02446, 2021. [55] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in Interspeech, 2017, pp. 498–502.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84701	-
dc.description.abstract	深度學習的最新進展不僅促進了零樣本歌聲合成和歌聲轉換任務的實現，同時也提供了將這兩個任務統一為一個通用模型的機會。在本文中我們提出了一個統一兩項任務的模型，可以從文本或音頻格式的任意源歌唱內容生成任意目標歌手的歌聲。該模型結合了處理文本輸入的詞源編碼器以及處理音頻輸入的聲源編碼器進行訓練，並透過以動態規劃為基礎的自督導式學習，編碼器將會在訓練過程中學習如何將音頻與音素進行最佳的對齊。這些編碼器也將音頻和文本數據分別映射到一個相似的潛在空間中，使得歌聲轉換與合成兩項任務可以透過同一個解碼器來完成。目標歌手的參考音檔被轉換成以幀為單位的碎片化資訊，並透過注意機制來根據源內容進行提取與重構，這使模型能夠在測試階段從文本或音頻源生成沒學習過的目標歌手的聲音。客觀和主觀實驗都證實，所提出的模型表現超越過去最佳的任意歌聲轉換與任意歌聲合成模型。	zh_TW
dc.description.abstract	Recent advances in deep learning not only facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks, but also provide the opportunity to unify these two tasks into one gen- eralized model. In this paper, we propose such a model that can generate singing voice of any target singer from any source singing content in either text or audio format. The model incorporates self-supervised joint training of the phonetic source encoder and the acoustic source encoder, with an audio- to-phoneme alignment process in each training step, such that these encoders map the audio and text data respectively into a shared, temporally aligned, and singer-agnostic latent space. The target singer’s latent representations en- coded at different granularity levels are all trained to match the source latent representations sequentially with the attention mechanisms in the decoding stage. This enables the model to generate unseen target singer’s voice with fine-grained resolution from either text or audio sources during the inference stage. Both objective and subjective experiments confirmed that the proposed model is competitive with the state-of-the-art SVC and SVS methods.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T22:21:14Z (GMT). No. of bitstreams: 1 U0001-0709202223252900.pdf: 5121711 bytes, checksum: 71d538e63c6a447df93bbc1ce5b40bca (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	摘要 i Abstract ii 1 Introduction 1 1.1 Researchmotivation ............................ 1 1.2 Contributions ................................ 2 1.3 Chapteroverview.............................. 2 2 Previous Work 4 2.1 Backgroundofvoicegeneration ...................... 4 2.1.1 Conversion ............................. 4 2.1.2 Synthesis .............................. 9 2.2 Zero-shotvoiceadaptation ......................... 12 2.2.1 Speakerencodermethod ...................... 13 2.2.2 Attention-basedmethod ...................... 14 2.3 Combiningsynthesisandconversion.................... 15 3 Method 17 3.1 Sourceencoder ............................... 18 3.2 Targetencoder................................ 20 3.3 Decoder................................... 21 3.4 Extractor .................................. 21 3.5 Discriminator ................................ 23 3.6 Two-phasetraining ............................. 24 3.7 Inference .................................. 25 3.8 Vocoder................................... 26 4 Datasets 29 4.1 MPOP600.................................. 29 4.2 NUS-48E .................................. 29 4.3 VCTK.................................... 30 4.4 MUSDB18 ................................. 30 4.5 Datasetpreprocessing............................ 30 5 Experimental Setup 32 5.1 Implementationdetail............................ 32 5.2 Evaluationmethod ............................. 33 5.2.1 Testscenarios............................ 33 5.2.2 Objectiveevaluation ........................ 33 5.2.3 Subjectiveevaluation........................ 33 5.3 Modelsforcomparison ........................... 34 5.3.1 FragmentSV ............................ 34 5.3.2 AutoSVC.............................. 35 5.3.3 DeepSinger............................. 36 5.4 Roadmapforexperiments.......................... 36 6 Results and Discussion 38 6.1 Comparisonbetweendifferentmodels ................... 38 6.2 MOStest .................................. 39 6.3 Effectivenessondifferentdatasets ..................... 42 6.4 AblationstudyofParallelWaveGANSTFTloss. . . . . . . . . . . . . . 43 6.5 Different methods of processing singer information . . . . . . . . . . . . 44 6.6 Sourceencoderanalysis........................... 46 6.6.1 AblationstudyonMAS ...................... 46 6.6.2 Comparison between different types of source encoders for SVC . 49 7 Conclusions and Future Work 52 7.1 Conclusions................................. 52 7.2 FutureWork................................. 52 Bibliography 54
dc.language.iso	en
dc.subject	自督導式學習	zh_TW
dc.subject	歌聲轉換	zh_TW
dc.subject	歌聲合成	zh_TW
dc.subject	零樣本學習	zh_TW
dc.subject	self-supervised learning	en
dc.subject	singing voice conversion	en
dc.subject	singing voice synthesis	en
dc.subject	zero-shot learning	en
dc.title	零樣本歌聲轉換與合成的統一模型	zh_TW
dc.title	A Unified Model for Zero-Shot Singing Voice Conversion and Synthesis	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.coadvisor	張智星(Jyh-Shing Jang)
dc.contributor.oralexamcommittee	楊奕軒(Yi-Hsuan Yang),王崇喆(Chung-Che Wang)
dc.subject.keyword	歌聲轉換,歌聲合成,零樣本學習,自督導式學習,	zh_TW
dc.subject.keyword	singing voice conversion,singing voice synthesis,zero-shot learning,self-supervised learning,	en
dc.relation.page	60
dc.identifier.doi	10.6342/NTU202203241
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2022-09-08
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資料科學學位學程	zh_TW
dc.date.embargo-lift	2022-09-14	-
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
U0001-0709202223252900.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	5 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。