請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77780完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 張智星(Jyh-Shing Roger Jang) | |
| dc.contributor.author | Cheng-Wei Wu | en |
| dc.contributor.author | 巫承威 | zh_TW |
| dc.date.accessioned | 2021-07-11T14:34:43Z | - |
| dc.date.available | 2023-07-23 | |
| dc.date.copyright | 2018-07-23 | |
| dc.date.issued | 2018 | |
| dc.date.submitted | 2018-07-20 | |
| dc.identifier.citation | [1] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017.
[2] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-toimage translation using cycle-consistent adversarial networkss. In Proc. IEEE Int. Conf. Computer Vision, 2017. [3] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016. [4] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010. [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [6] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Proc. Int. Conf. Medical Image Computing and Computer-assisted Intervention, pages 234–241. Springer, 2015. [7] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in neural information processing systems, pages 2802–2810, 2016. [8] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance Normalization: The Missing Ingredient for Fast Stylization. ArXiv e-prints, July 2016. [9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ioffe15.html. [10] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013. [11] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv: 1511.07289, 2015. [12] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In Advances in Neural Information Processing Systems, pages 972–981, 2017. [13] Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. Conf. Empirical Methods in Natural Language Processing, pages 1724–1734, 2014. [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015. [16] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. [17] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016. [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [19] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara. Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 11(2):71–76, 1990. [20] Alexander Kain and Michael W Macon. Spectral voice conversion for text-to-speech synthesis. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, volume 1, pages 285–288. IEEE, 1998. [21] Yannis Stylianou, Olivier Cappé, and Eric Moulines. Continuous probabilistic transform for voice conversion. IEEE Transactions on speech and audio processing, 6 (2):131–142, 1998. [22] Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of straight spectrum. In Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP’01). 2001 IEEE International Conference on, volume 2, pages 841–844. IEEE, 2001. [23] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. Voice conversion using artificial neural networks. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 3893–3896. IEEE, 2009. [24] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad. Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5):954–964, 2010. [25] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1859–1872, 2014. [26] Takuhiro Kaneko and Hirokazu Kameoka. Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293, 2017. [27] Tetsuya Hashimoto, Hidetsugu Uchida, Daisuke Saito, and Nobuaki Minematsu. Parallel-data-free many-to-many voice conversion based on dnn integrated with eigenspace using a non-parallel speech corpus. Proc. Interspeech 2017, pages 1278–1282, 2017. [28] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. arXiv preprint arXiv:1804.02812, 2018. [29] Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, and Richard Socher. A multidiscriminator cyclegan for unsupervised non-parallel speech domain adaptation. arXiv preprint arXiv:1804.00522, 2018. [30] Pedro Cano, Alex Loscos, Jordi Bonada, Maarten De Boer, and Xavier Serra. Voice morphing system for impersonating in karaoke applications. In ICMC, 2000. [31] Anders Eriksson, C Llamas, and D Watt. The disguised voice: imitating accents or speech styles and impersonating individuals. Language and identities, 8:86–96, 2010. [32] Yang Gao, Rita Singh, and Bhiksha Raj. Voice impersonation using generative adversarial networks. arXiv preprint arXiv:1802.06840, 2018. [33] Yukara Ikemiya, Katsutoshi Itoyama, and Hiroshi G Okuno. Transferring vocal expression of f0 contour using singing voice synthesizer. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pages 250–259. Springer, 2014. [34] D. Ulyanov and V. Lebedev. Singing style transfer, 2016. URL https://dmitryulyanov.github.io/audio-texture-synthesis-andstyle-transfer/. [35] O. B. Bohan. Singing style transfer, 2017. URL http://madebyoll.in/posts/singing_style_transfer/. [36] Eric Grinstein, Ngoc Duong, Alexey Ozerov, and Patrick Perez. Audio style transfer. arXiv preprint arXiv:1710.11385, 2017. [37] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017. [38] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In Proc. European Conf. Computer Vision, pages 702–716. Springer, 2016. [39] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2017. [40] Tak-Shing Chan, Tzu-Chun Yeh, Zhe-Cheng Fan, Hung-Wei Chen, Li Su, Yi-Hsuan Yang, and Roger Jang. Vocal activity informed singing voice separation with the ikala dataset. In Proc. IEEE. Int. Conf. Acoustics, Speech and Signal Processing, pages 718–722, 2015. [41] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. [42] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016. [43] Richard W Brislin. Back-translation for cross-cultural research. Journal of crosscultural psychology, 1(3):185–216, 1970. [44] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. [45] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. [46] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [47] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 3431–3440, 2015. [48] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. [49] Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011. [50] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 1998. [51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [52] Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984. [53] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep U-net convolutional networks. In Proc. Int. Soc. Music Information Retrieval Conf., pages 745–751, 2017. [54] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. arXiv preprint arXiv:1804.06267, 2018. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77780 | - |
| dc.description.abstract | 本篇論文聚焦在歌唱人聲轉換,並借鑑以產生高解析度輸出著稱的 生成式對抗網路來作為模型基底架構,期望能在沒有成對資料的狀況 下將不同歌手的歌唱風格進行轉換,並同時擁有高解析度與逼真的人 聲。在本篇論文中,我們透過將邊界平衡生成式對抗網路的訓練方法 引入循環一致生成式對抗網路中來穩定訓練過程。而在模型架構部分, 我們加入了對稱式跳躍連接 (Symmetric Skip-Connection),讓轉換後的 人聲接近自然人聲並同時擁有高解析度。對於時間資訊,我們則在神 經網路輸出層前加入閘門遞迴單元 (Gated Recurrent Unit, GRU),不僅 進一步提升相對音高的準確度,更增加了整體歌唱人聲輸出的品質。 為了驗證本論文所提出的訓練方法以及模型架構,我們將訓練方法與 模型架構分別拆解,並進行以內部測試 (Inside Test) 為基準的平均意見 分數 (MOS, Mean Opinion Score) 之問卷調查。根據實驗結果顯示,本 論文提出的訓練方法以及模型架構不僅能顯著地轉換不同歌手間的歌 唱風格,亦能產生具有高解析度且真實的歌唱人聲。 | zh_TW |
| dc.description.abstract | This thesis focuses on the singing style transfer and attempts to generate natural vocal sound with high-resolution while transferring the singing style of a given singer to the target singer, using the generative adversarial networks which are capable of synthesizing high-resolution output. In this work, we integrate the Boundary Equilibrium Generative Adversarial Networks with Cycle-Consistent Generative Adversarial Networks to stabilize the training procedure. For the model architecture, we add the symmetric skip-connection to make the transferred vocal more natural. To account for temporal information, we add GRU units before the output layer of the network which not only improves the accuracy of the relative pitch but also enhances the overall quality of the outputs of the singing voice. To validate our proposed training strategy and model architecture, we disentangle both the training strategy and model architecture to conduct the inside test based subjective evaluation via MOS (Mean Opinion Score). According to the results of the subjective evaluation, our proposed training strategy and model architecture can significantly transfer the singing style between different singers and generate natural singing voice with high-resolution. | en |
| dc.description.provenance | Made available in DSpace on 2021-07-11T14:34:43Z (GMT). No. of bitstreams: 1 ntu-107-R05944004-1.pdf: 14670770 bytes, checksum: f7ad3b41f219ddd886739f4bd2ddb318 (MD5) Previous issue date: 2018 | en |
| dc.description.tableofcontents | 口試委員會審定書 i
致謝ii 中文摘要iv Abstract v Contents vi List of Figures viii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 5 2.1 Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Image-to-Image Translation with Conditional Adversarial Networks 6 2.1.2 Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Training Framework of Generative Adversarial Networks . . . . . . . . . 9 2.2.1 Energy-based Generative Adversarial Networks . . . . . . . . . . 9 2.2.2 BEGAN: Boundary Equilibrium Generative Adversarial Networks 11 3 Proposed Model 15 3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Cycle Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Boundary Equilibrium . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Experimental Results 23 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Initialization Method . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 U-Net Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.4 Architecture of Convolutional Neural Network . . . . . . . . . . 30 4.2.5 Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.6 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.7 Usage of Recurrent Layers . . . . . . . . . . . . . . . . . . . . . 38 4.2.8 Architecture of Recurrent Layers . . . . . . . . . . . . . . . . . . 42 4.2.9 Algorithms of Gradient Descent . . . . . . . . . . . . . . . . . . 42 4.2.10 Window Size & Hop Size of Short-Time Fourier Transform . . . 44 4.3 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Conclusion & Future Work 49 References 50 | |
| dc.language.iso | en | |
| dc.subject | 歌唱人聲轉換 | zh_TW |
| dc.subject | 生成式對抗網路 | zh_TW |
| dc.subject | Generative Adversarial Networks | en |
| dc.subject | Singing Style Transfer | en |
| dc.title | 基於循環一致邊界平衡生成式對抗網路之歌唱風格轉換 | zh_TW |
| dc.title | Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial Networks | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 106-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.coadvisor | 楊奕軒(Yi-Hsuan Yang) | |
| dc.contributor.oralexamcommittee | 李宏毅(Hung-Yi Lee) | |
| dc.subject.keyword | 生成式對抗網路,歌唱人聲轉換, | zh_TW |
| dc.subject.keyword | Generative Adversarial Networks,Singing Style Transfer, | en |
| dc.relation.page | 56 | |
| dc.identifier.doi | 10.6342/NTU201801003 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2018-07-20 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
| dc.date.embargo-lift | 2023-07-23 | - |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-107-R05944004-1.pdf 未授權公開取用 | 14.33 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
