基於循環一致邊界平衡生成式對抗網路之歌唱風格轉換

Cheng-Wei Wu; 巫承威

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77780

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星(Jyh-Shing Roger Jang)
dc.contributor.author	Cheng-Wei Wu	en
dc.contributor.author	巫承威	zh_TW
dc.date.accessioned	2021-07-11T14:34:43Z	-
dc.date.available	2023-07-23
dc.date.copyright	2018-07-23
dc.date.issued	2018
dc.date.submitted	2018-07-20
dc.identifier.citation	[1] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017. [2] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-toimage translation using cycle-consistent adversarial networkss. In Proc. IEEE Int. Conf. Computer Vision, 2017. [3] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016. [4] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010. [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [6] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Proc. Int. Conf. Medical Image Computing and Computer-assisted Intervention, pages 234–241. Springer, 2015. [7] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in neural information processing systems, pages 2802–2810, 2016. [8] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance Normalization: The Missing Ingredient for Fast Stylization. ArXiv e-prints, July 2016. [9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ioffe15.html. [10] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013. [11] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv: 1511.07289, 2015. [12] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In Advances in Neural Information Processing Systems, pages 972–981, 2017. [13] Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. Conf. Empirical Methods in Natural Language Processing, pages 1724–1734, 2014. [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015. [16] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. [17] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016. [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [19] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara. Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 11(2):71–76, 1990. [20] Alexander Kain and Michael W Macon. Spectral voice conversion for text-to-speech synthesis. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, volume 1, pages 285–288. IEEE, 1998. [21] Yannis Stylianou, Olivier Cappé, and Eric Moulines. Continuous probabilistic transform for voice conversion. IEEE Transactions on speech and audio processing, 6 (2):131–142, 1998. [22] Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of straight spectrum. In Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP’01). 2001 IEEE International Conference on, volume 2, pages 841–844. IEEE, 2001. [23] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. Voice conversion using artificial neural networks. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 3893–3896. IEEE, 2009. [24] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad. Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5):954–964, 2010. [25] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1859–1872, 2014. [26] Takuhiro Kaneko and Hirokazu Kameoka. Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293, 2017. [27] Tetsuya Hashimoto, Hidetsugu Uchida, Daisuke Saito, and Nobuaki Minematsu. Parallel-data-free many-to-many voice conversion based on dnn integrated with eigenspace using a non-parallel speech corpus. Proc. Interspeech 2017, pages 1278–1282, 2017. [28] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. arXiv preprint arXiv:1804.02812, 2018. [29] Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, and Richard Socher. A multidiscriminator cyclegan for unsupervised non-parallel speech domain adaptation. arXiv preprint arXiv:1804.00522, 2018. [30] Pedro Cano, Alex Loscos, Jordi Bonada, Maarten De Boer, and Xavier Serra. Voice morphing system for impersonating in karaoke applications. In ICMC, 2000. [31] Anders Eriksson, C Llamas, and D Watt. The disguised voice: imitating accents or speech styles and impersonating individuals. Language and identities, 8:86–96, 2010. [32] Yang Gao, Rita Singh, and Bhiksha Raj. Voice impersonation using generative adversarial networks. arXiv preprint arXiv:1802.06840, 2018. [33] Yukara Ikemiya, Katsutoshi Itoyama, and Hiroshi G Okuno. Transferring vocal expression of f0 contour using singing voice synthesizer. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pages 250–259. Springer, 2014. [34] D. Ulyanov and V. Lebedev. Singing style transfer, 2016. URL https://dmitryulyanov.github.io/audio-texture-synthesis-andstyle-transfer/. [35] O. B. Bohan. Singing style transfer, 2017. URL http://madebyoll.in/posts/singing_style_transfer/. [36] Eric Grinstein, Ngoc Duong, Alexey Ozerov, and Patrick Perez. Audio style transfer. arXiv preprint arXiv:1710.11385, 2017. [37] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017. [38] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In Proc. European Conf. Computer Vision, pages 702–716. Springer, 2016. [39] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2017. [40] Tak-Shing Chan, Tzu-Chun Yeh, Zhe-Cheng Fan, Hung-Wei Chen, Li Su, Yi-Hsuan Yang, and Roger Jang. Vocal activity informed singing voice separation with the ikala dataset. In Proc. IEEE. Int. Conf. Acoustics, Speech and Signal Processing, pages 718–722, 2015. [41] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. [42] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016. [43] Richard W Brislin. Back-translation for cross-cultural research. Journal of crosscultural psychology, 1(3):185–216, 1970. [44] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. [45] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. [46] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [47] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 3431–3440, 2015. [48] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. [49] Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011. [50] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 1998. [51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [52] Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984. [53] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep U-net convolutional networks. In Proc. Int. Soc. Music Information Retrieval Conf., pages 745–751, 2017. [54] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. arXiv preprint arXiv:1804.06267, 2018.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77780	-
dc.description.abstract	本篇論文聚焦在歌唱人聲轉換，並借鑑以產生高解析度輸出著稱的生成式對抗網路來作為模型基底架構，期望能在沒有成對資料的狀況下將不同歌手的歌唱風格進行轉換，並同時擁有高解析度與逼真的人聲。在本篇論文中，我們透過將邊界平衡生成式對抗網路的訓練方法引入循環一致生成式對抗網路中來穩定訓練過程。而在模型架構部分，我們加入了對稱式跳躍連接 (Symmetric Skip-Connection)，讓轉換後的人聲接近自然人聲並同時擁有高解析度。對於時間資訊，我們則在神經網路輸出層前加入閘門遞迴單元 (Gated Recurrent Unit, GRU)，不僅進一步提升相對音高的準確度，更增加了整體歌唱人聲輸出的品質。為了驗證本論文所提出的訓練方法以及模型架構，我們將訓練方法與模型架構分別拆解，並進行以內部測試 (Inside Test) 為基準的平均意見分數 (MOS, Mean Opinion Score) 之問卷調查。根據實驗結果顯示，本論文提出的訓練方法以及模型架構不僅能顯著地轉換不同歌手間的歌唱風格，亦能產生具有高解析度且真實的歌唱人聲。	zh_TW
dc.description.abstract	This thesis focuses on the singing style transfer and attempts to generate natural vocal sound with high-resolution while transferring the singing style of a given singer to the target singer, using the generative adversarial networks which are capable of synthesizing high-resolution output. In this work, we integrate the Boundary Equilibrium Generative Adversarial Networks with Cycle-Consistent Generative Adversarial Networks to stabilize the training procedure. For the model architecture, we add the symmetric skip-connection to make the transferred vocal more natural. To account for temporal information, we add GRU units before the output layer of the network which not only improves the accuracy of the relative pitch but also enhances the overall quality of the outputs of the singing voice. To validate our proposed training strategy and model architecture, we disentangle both the training strategy and model architecture to conduct the inside test based subjective evaluation via MOS (Mean Opinion Score). According to the results of the subjective evaluation, our proposed training strategy and model architecture can significantly transfer the singing style between different singers and generate natural singing voice with high-resolution.	en
dc.description.provenance	Made available in DSpace on 2021-07-11T14:34:43Z (GMT). No. of bitstreams: 1 ntu-107-R05944004-1.pdf: 14670770 bytes, checksum: f7ad3b41f219ddd886739f4bd2ddb318 (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員會審定書 i 致謝ii 中文摘要iv Abstract v Contents vi List of Figures viii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 5 2.1 Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Image-to-Image Translation with Conditional Adversarial Networks 6 2.1.2 Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Training Framework of Generative Adversarial Networks . . . . . . . . . 9 2.2.1 Energy-based Generative Adversarial Networks . . . . . . . . . . 9 2.2.2 BEGAN: Boundary Equilibrium Generative Adversarial Networks 11 3 Proposed Model 15 3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Cycle Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Boundary Equilibrium . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Experimental Results 23 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Initialization Method . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 U-Net Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.4 Architecture of Convolutional Neural Network . . . . . . . . . . 30 4.2.5 Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.6 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.7 Usage of Recurrent Layers . . . . . . . . . . . . . . . . . . . . . 38 4.2.8 Architecture of Recurrent Layers . . . . . . . . . . . . . . . . . . 42 4.2.9 Algorithms of Gradient Descent . . . . . . . . . . . . . . . . . . 42 4.2.10 Window Size & Hop Size of Short-Time Fourier Transform . . . 44 4.3 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Conclusion & Future Work 49 References 50
dc.language.iso	en
dc.subject	歌唱人聲轉換	zh_TW
dc.subject	生成式對抗網路	zh_TW
dc.subject	Generative Adversarial Networks	en
dc.subject	Singing Style Transfer	en
dc.title	基於循環一致邊界平衡生成式對抗網路之歌唱風格轉換	zh_TW
dc.title	Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial Networks	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.coadvisor	楊奕軒(Yi-Hsuan Yang)
dc.contributor.oralexamcommittee	李宏毅(Hung-Yi Lee)
dc.subject.keyword	生成式對抗網路,歌唱人聲轉換,	zh_TW
dc.subject.keyword	Generative Adversarial Networks,Singing Style Transfer,	en
dc.relation.page	56
dc.identifier.doi	10.6342/NTU201801003
dc.rights.note	有償授權
dc.date.accepted	2018-07-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
dc.date.embargo-lift	2023-07-23	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-107-R05944004-1.pdf 未授權公開取用	14.33 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。