基於類神經網路的端對端語音合成系統之表現強化

Chi-Yu Yang; 楊棋宇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71969

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李宏毅
dc.contributor.author	Chi-Yu Yang	en
dc.contributor.author	楊棋宇	zh_TW
dc.date.accessioned	2021-06-17T06:17:07Z	-
dc.date.available	2019-08-22
dc.date.copyright	2018-08-22
dc.date.issued	2018
dc.date.submitted	2018-08-21
dc.identifier.citation	[1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [2] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas- Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young, “A network-based end-to-end trainable task-oriented dialogue system,” arXiv preprint arXiv:1604.04562, 2016. [3] Lih-Yuan Deng, “The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation, and machine learning,” 2006. [4] Toma ́sˇ Mikolov, Martin Karafia ́t, Luka ́sˇ Burget, Jan Cˇernocky`, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010. [5] Kyunghyun Cho, Bart Van Merrie ̈nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [7] Yann LeCun, Yoshua Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995. [8] Yao Qian, Frank Soong, Yining Chen, and Min Chu, “An hmm-based mandarin chinese text-to-speech system,” in Chinese Spoken Language Processing, pp. 223– 232. Springer, 2006. [9] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [10] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017. [11] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio, “Char2wav: End-to-end speech synthesis,” 2017. [12] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016. [13] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech syn,” arXiv preprint arXiv:1703.10135, 2017. [14] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984. [15] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [16] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [18] Rupesh Kumar Srivastava, Klaus Greff, and Ju ̈rgen Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015. [19] Keith Ito, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017. [20] Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” arXiv preprint arXiv:1710.08969, 2017. [21] R Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Communications, Computers and Signal Processing, 1993., IEEE Pacific Rim Conference on. IEEE, 1993, vol. 1, pp. 125–128. [22] Alex Barron, “End-to-end neural speech synthesis,” . [23] Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” in International Conference on Learning Representations (ICLR), 2018. [24] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018. [25] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964. [26] Sepp Hochreiter and Ju ̈rgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [27] Martin Sundermeyer, Ralf Schlu ̈ter, and Hermann Ney, “Lstm neural networks for language modeling,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012. [28] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2017. [29] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71969	-
dc.description.abstract	本論文之主軸在探討以序列對序列模型實作語音合成，並且強化多語者之語音合成。隨著科技的演進，智慧裝置已經融入我們的生活，在各式場合隨處可見，人們偏好使用更直覺的語音來取代文字輸入與智慧裝置溝通，裝置同樣也以語音回饋，語音合成技術就顯得相當重要。傳統語音合成系統大致可分為串接式語音合成與統計模型式語音合成兩大類，而近期隨著類神經網路如火如荼的發展，語音合成大部分基於深度類神經網路的模型來實現。本論文所使用之塔可創 (Tacotron) 模型，即為基於深度類神經網路的模型，塔可創模型在近期語音合成領域相當火紅，能合成出品質良好的語音，不過此前大部分的研究都以英文為主。本論文首先研究比較以不同粒度文字單位作為端對端中文語音合成模型之輸入，對合成語音品質的影響，並加入引導式專注機制 (Guided Attention)，希望能夠引導模型在合成語音時，專注於文字編碼正確的位置，快速學好專注機制。接著使用塔可創模型實現端對端中文文字對閩南語語音之語音合成系統，希望能夠達成即使目標語言沒有標準的文字，也能夠以端對端學習利用來源語言文字與目標語言語音的對應關係，輸入來源語言文字來合成目標語言語音，實作中另外加入了計劃式取樣 (Schedule Sampling) 嘗試解決合成語音品質不佳的問題。最後以加入參考音檔編碼器之塔可創模型來實現多語者語音合成系統，並且引入自動語音辨識鑑別器強化此多語者語音合成系統，解決模型依賴過多參考音檔中的文字資訊而忽略輸入文字資訊，造成合成出的語音與輸入文字無關或是語音模糊的問題，能夠達成在犧牲極少語音品質的狀況下，不受參考音檔的影響。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-06-17T06:17:07Z (GMT). No. of bitstreams: 1 ntu-107-R05942031-1.pdf: 11549040 bytes, checksum: f1d3fd89bcb74344b8dbde4286dceaa7 (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	誌謝.......................................... ii 中文摘要....................................... v 一、導論....................................... 1 1.1 研究動機.................................. 1 1.2 研究方向.................................. 3 1.3 相關研究.................................. 4 1.3.1 單元串接式語音合成 ....................... 4 1.3.2 統計模型式語音合成 ....................... 5 1.3.3 類神經網路式語音合成...................... 5 1.4 章節安排.................................. 5 二、背景知識 .................................... 7 2.1 序列對序列學習(Sequencetosequencelearning) . . . . . . . . . . . . 7 2.1.1 簡介 ................................ 7 2.1.2 類神經網路(NeuralNetwork)................... 7 2.1.3 遞迴式類神經網路 (Recurrent Neural Network, RNN) . . . . . . 11 2.1.4 序列對序列模型 (Sequence to Sequence Model) . . . . . . . . . 13 2.2 傳統語音合成系統 (Conventional Text to Speech System) . . . . . . . . 16 2.2.1 簡介 ................................ 16 2.2.2 訓練階段.............................. 16 2.2.3 合成階段.............................. 19 2.3 基於類神經網路之語音合成系統 (Neural Network Based Text to SpeechSystem)............................... 21 2.3.1 簡介 ................................ 21 2.3.2 波形網路(WaveNet)........................ 21 2.3.3 字元對波形(Char2Wav)...................... 25 2.4 本章總結.................................. 29 三、以不同粒度文字單位作為端對端中文語音合成系統之輸入 . . . . . . . . 30 3.1 簡介..................................... 30 3.2 模型架構介紹 ............................... 32 3.2.1 編碼器模組(EncoderModule) .................. 34 3.2.2 解碼器模組(DecoderModule) .................. 36 3.2.3 後處理模組(Post-processingModule) . . . . . . . . . . . . . . 38 3.3 資料集介紹................................. 39 3.3.1 LJSpeech.............................. 39 3.3.2 WEB ................................ 40 3.3.3 上課錄音.............................. 40 3.4 基本實驗配置 ............................... 41 3.4.1 前置處理.............................. 41 3.4.2 基準實驗.............................. 46 3.5 實驗結果與討論 .............................. 49 3.5.1 不同輸入粒度之比較 ....................... 49 3.5.2 加入引導式專注機制與否之比較 ................ 55 3.6 本章總結.................................. 57 四、端對端中文文字對閩南語語音之語音合成系統 ............... 58 4.1 簡介..................................... 58 4.2 模型架構介紹 ............................... 59 4.3 資料集介紹................................. 62 4.4 基本實驗配置 ............................... 63 4.4.1 前置處理.............................. 63 4.4.2 基準實驗.............................. 64 4.5 實驗結果與討論 .............................. 64 4.5.1 不同輸入粒度之比較 ....................... 64 4.5.2 加入引導式專注機制與否之比較 ................ 65 4.5.3 加入計畫式取樣機制與否之比較 ................ 73 4.6 本章總結.................................. 74 五、利用自動語音辨識鑑別器強化多語者端對端語音合成系統 . . . . . . . . 75 5.1 簡介..................................... 75 5.2 模型架構介紹 ............................... 77 5.2.1 參考音檔編碼器模組 (Reference Encoder Module) . . . . . . . 79 5.2.2 自動語音辨識鑑別器 (Automatic Speech Recognition Discrim- inator)................................ 79 5.3 資料集介紹................................. 84 5.3.1 VCTK資料集 ........................... 84 5.3.2 Librispeech資料集......................... 84 5.4 基本實驗配置 ............................... 85 5.4.1 前置處理.............................. 85 5.4.2 基準實驗.............................. 88 5.5 實驗結果與討論 .............................. 90 5.5.1 語者分類衡量 ........................... 90 5.5.2 自動語音辨識系統衡量...................... 91 5.5.3 全域變異數衡量結果 ....................... 92 5.5.4 人為衡量結果 ........................... 92 5.6 本章總結.................................. 94 六、結論與展望 ................................... 95 6.1 結論..................................... 95 6.2 未來展望.................................. 96 參考文獻....................................... 97
dc.language.iso	zh-TW
dc.title	基於類神經網路的端對端語音合成系統之表現強化	zh_TW
dc.title	Performance Improvement of Neural Network based End-to-end Text-to-Speech System	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李琳山,陳信宏,鄭秋豫,王小川
dc.subject.keyword	語音合成,端對端,粒度,語音轉換,	zh_TW
dc.subject.keyword	speech synthesis,end to end,granularity,voice conversion,	en
dc.relation.page	101
dc.identifier.doi	10.6342/NTU201803811
dc.rights.note	有償授權
dc.date.accepted	2018-08-21
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 目前未授權公開取用	11.28 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。