以分別嵌入語者及語言內容資訊之深層生成模型達成無監督式語音轉換

Ju-Chieh Chou; 周儒杰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55505

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山(Lin-shan Lee)
dc.contributor.author	Ju-Chieh Chou	en
dc.contributor.author	周儒杰	zh_TW
dc.date.accessioned	2021-06-16T04:06:16Z	-
dc.date.available	2020-09-02
dc.date.copyright	2020-09-02
dc.date.issued	2019
dc.date.submitted	2020-08-13
dc.identifier.citation	[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436, 2015. [2] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, vol. 29, 2012. [3] Heiga Ze, Andrew Senior, and Mike Schuster, “Statistical parametric speech synthesis using deep neural networks,” in 2013 ieee international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 7962–7966. [4] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. [5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,”in Advances in neural information processing systems, 2014, pp. 2672–2680. [6] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Voice conversion using sequence-to-sequence learning of context posterior probabilities,” arXiv preprint arXiv:1704.02360, 2017. [7] Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “Voice conversion based on speaker-dependent restricted boltzmann machines,” IEICE TRANSACTIONS on Information and Systems, vol. 97, no. 6, pp. 1403–1410, 2014. [8] Tomi Kinnunen, Lauri Juvela, Paavo Alku, and Junichi Yamagishi, “Non-parallel voice conversion using i-vector plda: Towards unifying speaker verification and transformation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5535–5539. [9] Nobukatsu Hojo, Hirokazu Kameoka, Kou Tanaka, and Takuhiro Kaneko “Automatic speech pronunciation correction with dynamic frequency warping-based spectral conversion,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2310–2314. [10] Li-Wei Chen, Hung-Yi Lee, and Yu Tsao, “Generative adversarial networks for unpaired voice transformation on impaired speech,” arXiv preprint arXiv:1810.12656, 2018. [11] S Arun Kumar and C Santhosh Kumar, “Improving the intelligibility of dysarthric speech towards enhancing the effectiveness of speech therapy,” in Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on. IEEE, 2016, pp. 1000–1005. [12] Takuhiro Kaneko and Hirokazu Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104. [13] Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka “Wavecyclegan: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1809.10288, 2018. [14] Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, 2018, pp. 4485–4495. [15] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Machine speech chain with one-shot speaker adaptation,” arXiv preprint arXiv:1803.10525, 2018. [16] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018. [17] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017, 2018. [18] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 954–964, 2010. [19] Seyed Hamidreza Mohammadi and Alexander Kain, “Voice conversion using deep neural networks with speaker-independent pre-training,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 19–23. [20] Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014. [21] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4869–4873. [22] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014. [23] Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura, “A postfilter to modify the modulation spectrum in hmm-based speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 290–294. [24] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano, “Maximum likelihood voice conversion based on gmm with straight mixed excitation,” 2006. [25] Yannis Stylianou, Olivier Cappe, and Eric Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998. [26] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222-2235, 2007. [27] Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef Gabbouj, “Voice conversion using partial least squares regression,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 912–921, 2010. [28] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016. [29] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, and Lin-shan Lee, “Rhythm-flexible voice conversion without parallel data using cycle-gan over phoneme posteriorgram sequences,” arXiv preprint arXiv:1808.03113, 2018. [30] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen M Meng, “Personalized, cross-lingual tts using phonetic posteriorgrams.,” in INTERSPEECH, 2016, pp. 322–326. [31] Feng-Long Xie, Frank K Soong, and Haifeng Li, “A kl divergence and dnn-based approach to voice conversion without parallel training sentences.,” in INTERSPEECH, 2016, pp. 287–291. [32] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, “Starganvc: Non-parallel many-to-many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169, 2018. [33] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo “Acvaevc: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder,” arXiv preprint arXiv:1808.05092, 2018. [34] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017. [35] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812, 2018. [36] Xun Huang and Serge Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1501–1510. [37] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. [38] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernock ´ y, and Sanjeev Khu- `danpur, “Recurrent neural network based language model.,” in INTERSPEECH, 2010, pp. 1045–1048. [39] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994. [40] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [41] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [42] Jonas Gehring, Yajie Miao, Florian Metze, and Alex Waibel, “Extracting deep bottleneck features using stacked auto-encoders,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 3377–3381. [43] Carl Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016. [44] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [45] Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2642–2651. [46] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired imageto-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. [47] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5769–5779. [48] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario Marchand, and Victor Lempitsky, “Domainadversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016. [49] Zhifei Zhang, Yang Song, and Hairong Qi, “Decoupled learning for conditional adversarial networks,” arXiv preprint arXiv:1801.06790, 2018. [50] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech syn,” arXiv preprint arXiv:1703.10135, 2017. [51] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [53] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [54] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P Aitken, Rob ´Bishop, Daniel Rueckert, and Zehan Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883. [55] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [56] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [57] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984. [58] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2017. [59] Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” 2018. [60] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, ¨Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,” SSW, vol. 125, 2016. [61] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision. Springer, 2016, pp. 694–711.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55505	-
dc.description.abstract	語音轉換(Voice Conversion)的目標是將語言內容(Content)保留，但將語者(Speaker)轉換至另一目標語者(Target Speaker)。因為這類任務常遭遇平行語料(Parallel Corpus)難以蒐集的問題，因此許多人投入使用非平行語料(Non-parallel Corpus)來進行語音轉換的研究。本論文首先提出了一個將單一目標(Single-target)語者的非平行語料的語音轉換擴展至多目標(Multi-target)語者的情境，也就是單一模型能夠轉換到多名目標語者。我們使用”分別嵌入(Separately Embedding)語言內容及語者資訊”的概念來設計模型。透過對抗式訓練(Adversarial Training)的架構，模型能夠學習如何生成語者不相關(Speaker Invariant)，亦即僅含語言內容，的表徵(Representations)，透過改變語者向量，模型便能夠將語音轉換到目標語者。同時，我們也提出了一個使用生成對抗網路(Generative Adversarial Networks, GANs)的方法來解決生成的語音過度平滑(Over-smoothed)的問題。在主觀測試上，此模型在自然程度(Naturalness)以及與目標語者相似度(Similarity)上，都取得與使用循環一致性對抗生成網路(Cycle Consistent Generative Adversarial Networks, CycleGANs)的基準模型(Baseline Model)相似的成效，然而該模型只適用於單一目標語者的情境下。除此之外，我們也提出了一個一次性樣本(One-shot)語音轉換的模型。在推論階段(Inference Stage)，只需要使用一句來源語者(Source Speaker)的語句(Utterance)，以及一句目標語者的語句，模型即可做到使用目標語者的聲音”說”出來源語音的內容。透過在變分自編碼器(Variational Autoencoders, VAEs)的內容編碼器(Content Encoder)引入實例正規化(Instance Normalization)，並且在解碼器(Decoder)使用自適應實例正規化(Adaptive Instance Normalization)的方式，模型能夠學習到將語者資訊以及內容資訊分別嵌入(Embedding)，進而能夠做到一次性樣本語音轉換。在相似度上，模型能夠生成與目標語者相像的語音。	zh_TW
dc.description.abstract	The goal of voice conversion is to keep the content related to language in the speech signal while converting the speaker to another target speaker. This kind of task usually suffers from the problem that a parallel corpus is difficult to collect. Thus, many investigate how to use a non-parallel corpus to do voice conversion. This thesis proposed to extend the scenario from a single target speaker to multi-target for non-parallel voice conversion. In other words, to use one model to convert to multiple target speakers. We based on the concept of 'separately embedding language content and speaker information' to design the model. By adversarial training, the model can learn to generate speaker-invariant representation, thus content-only. By changing speaker latent representation, the model can convert voice conversion to the target speaker. Also, we proposed an approach to use generative adversarial networks (GANs) to address the problem of over-smoothed speech. In subjective evaluation, this model achieves comparable results with the CycleGAN-based baseline model in terms of similarity to target speaker and naturalness of speech while this model can perform multi-target speaker conversion. Besides, we proposed a one-shot voice conversion model. At the inference stage, we only need one utterance from the source and target speaker respectively. The model can use the target speaker's voice to 'speak' the source speech content. By introducing instance normalization in the content encoder of variational autoencoders (VAEs), and putting adaptive instance normalization in the decoder, the model can learn to separately embed speaker information and content information. So that one-shot voice conversion is achieved. In the similarity evaluation, the model can learn to generate speech similar to the target speaker.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T04:06:16Z (GMT). No. of bitstreams: 1 U0001-2907202021200200.pdf: 7879201 bytes, checksum: 1e8ea41ab094545cc3d4bc1b38d85028 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Contents 英文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 一、導論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 相關研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 主要貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 二、背景知識 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 類神經網路訓練 . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 遞迴式類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 卷積式類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 自編碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 深層生成模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 2.2.1 變分自編碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 生成對抗網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 使用對抗式訓練之解纏表徵學習 . . . . . . . . . . . . . . . . 18 2.3 語音轉換 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 2.3.1 平行語料語音轉換 . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 非平行語料語音轉換 . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 三、無需平行語料之多目標聲音轉換 . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 模型架構與訓練過程 . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 自編碼器預訓練 . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 解纏表徵學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 生成對抗網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 網路架構與實施 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 3.3.1 模型使用元件 . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.3 實作細節 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 實驗 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 客觀評估 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.3 主觀評估 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 四、使用實例正規化分離語者及內容表示的一次性樣本語音轉換 . . . . . . . 47 4.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 4.2 提出方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50 4.2.1 變分自編碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 實例正規化 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 網路架構與實施 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52 4.3.1 網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 訓練細節 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.3 聲學特徵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 實驗 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.2 解纏程度分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.3 語者向量視覺化 . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.4 時頻譜 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.5 客觀評估 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.6 主觀評估 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62 五、結論與展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 主要貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64 5.2 未來研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
dc.language.iso	zh-TW
dc.subject	深層生成模型	zh_TW
dc.subject	語音	zh_TW
dc.subject	語音轉換	zh_TW
dc.subject	voice conversion	en
dc.subject	deep generative model	en
dc.subject	speech	en
dc.title	以分別嵌入語者及語言內容資訊之深層生成模型達成無監督式語音轉換	zh_TW
dc.title	Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.coadvisor	李宏毅(Hung-yi Lee)
dc.contributor.oralexamcommittee	鄭秋豫(Chiu-yu Tseng),王小川(Hsiao-Chuan Wang),陳信宏(Sin-Horng Chen)
dc.subject.keyword	語音,語音轉換,深層生成模型,	zh_TW
dc.subject.keyword	speech,voice conversion,deep generative model,	en
dc.relation.page	76
dc.identifier.doi	10.6342/NTU202002061
dc.rights.note	有償授權
dc.date.accepted	2020-08-14
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-2907202021200200.pdf 未授權公開取用	7.69 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。