Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71727
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山(Lin-shan Lee)
dc.contributor.authorTao Tuen
dc.contributor.author杜濤zh_TW
dc.date.accessioned2021-06-17T06:07:50Z-
dc.date.available2020-11-12
dc.date.copyright2020-11-12
dc.date.issued2020
dc.date.submitted2020-10-27
dc.identifier.citation[1] AndrewJHuntandAlanWBlack,“Unitselectioninaconcatenativespeechsynthe- sis system using a large speech database,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. IEEE, 1996, vol. 1, pp. 373–376.
[2] Heiga Zen, Keiichi Tokuda, and Alan W Black, “Statistical parametric speech synthesis,” speech communication, vol. 51, no. 11, pp. 1039–1064, 2009.
[3] HeigaZe,AndrewSenior,andMikeSchuster,“Statisticalparametricspeechsynthe- sis using deep neural networks,” in 2013 ieee international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 7962–7966.
[4] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[5] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.
[6] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
[7] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallel wave generation in
end-to-end text-to-speech,” arXiv preprint arXiv:1807.07281, 2018.
[8] Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017.
[9] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio, “Char2wav: End-to-end speech synthesis,” 2017.
[10] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, 2019, pp. 3165–3174.
[11] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
[12] Jihyun Park, Kexin Zhao, Kainan Peng, and Wei Ping, “Multi-speaker end-to-end speech synthesis,” arXiv preprint arXiv:1907.04462, 2019.
[13] Vincent Wan, Chun-an Chan, Tom Kenter, Jakub Vit, and Rob Clark, “Chive: Vary- ing prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network,” arXiv preprint arXiv:1905.07195, 2019.
[14] Andrew Rosenberg, Bhuvana Ramabhadran, Guangzhi Sun, Heiga Zen, Ron J. Weiss, Yonghui Wu, Yu Zhang, and Yuan Cao, “Generating diverse and natural text-to-speech samples using quantized fine-grained vae and autoregressive prosody prior,” in ICASSP, 2020.
[15] Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, and Thomas Drugman, “Fine-grained robust prosody transfer for single-speaker neural text-to-speech,” arXiv preprint arXiv:1907.02479, 2019.
[16] Younggun Lee and Taesu Kim, “Robust and fine-grained prosody control of end- to-end speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5911–5915.
[17] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018.
[18] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017, 2018.
[19] Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al., “Hierarchical genera- tive modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.
[20] YeJia,YuZhang,RonWeiss,QuanWang,JonathanShen,FeiRen,PatrickNguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in neural information processing systems, 2018, pp. 4480–4490.
[21] Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran, “Learning to speak flu- ently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” arXiv preprint arXiv:1907.04448, 2019.
[22] Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and RJ Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech syn- thesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6940–6944.
[23] Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Almost unsupervised text to speech and automatic speech recognition,” arXiv preprint arXiv:1905.06791, 2019.
[24] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[25] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[27] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning internal representations by error propagation,” Tech. Rep., California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[28] Sepp Hochreiter and Ju ̈rgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[29] Kyunghyun Cho, Bart Van Merrie ̈nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representa- tions using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[30] AlexGraves,SantiagoFerna ́ndez,FaustinoGomez,andJu ̈rgenSchmidhuber,“Con- nectionist temporal classification: labelling unsegmented sequence data with recur- rent neural networks,” in Proceedings of the 23rd international conference on Ma- chine learning, 2006, pp. 369–376.
[31] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
[32] Ronald J Williams and David Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
[33] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine trans- lation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[34] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016.
[35] Yu-Hsuan Wang, Hung-yi Lee, and Lin-shan Lee, “Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6269–6273.
[36] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Listening while speaking: Speech chain by deep learning,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 301–308.
[37] Aaron van den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6306–6315.
[38] YoshuaBengio,NicholasLe ́onard,and AaronCourville,“Estimatingorpropagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
[39] Keith Ito, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017.
[40] TomokiHayashi,ShinjiWatanabe,TomokiToda,KazuyaTakeda,ShubhamToshni- wal, and Karen Livescu, “Pre-trained text embeddings for enhanced text-to-speech synthesis,” Proc. Interspeech 2019, pp. 4430–4434, 2019.
[41] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald, “CSTR VCTK cor- pus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2017.
[42] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Analysis of the voice con- version challenge 2016 evaluation results.,” 2016.
[43] David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[44] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[45] Wei Fang, Yu-An Chung, and James Glass, “Towards transfer learning for end- to-end speech synthesis from deep pre-trained language models,” arXiv preprint arXiv:1906.07307, 2019.
[46] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[47] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end- to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71727-
dc.description.abstract文句翻語音是指將輸入文句轉換成語音的任務。透過能力強大的深層類神經網路,現今文句翻語音技術在許多評分上已與真實人聲幾乎無異。令人遺憾的是,訓練一個品質良好的文句翻語音模型需要大量文字與音訊之配對資料(標註資料),而收集標註資料的過程既費時又需高成本。另一方面,半監督式學習方法近來在許多自然語言處理及語音處理任務上均獲得良好結果。可以利用大量未標註資料來降低訓練模型所需之標註資料量,並提升模型之表現。綜合上述原因,本論文探討半監督式學習於文句翻語音任務上之效果,並以實驗分析所提出模型之性質以及標註資料和未標註資料之特性對於模型效果之影響。
本論文提出序列量化表徵自編碼器。該模型由編碼器、語音量化器以及解碼器所組成,其中量化器內含碼書,而碼書存有對應到各音素之語音表徵向量。序列量化表徵自編碼器透過編碼器及量化器將輸入音訊轉換成音素序列,並透過解碼器將音素序列還原成音訊,完成音訊重構。透過音訊重構,模型可自未標註音訊中學習人類語言之發音特性。在將音訊轉換成音素序列的過程中,量化器會自碼書中拿取對應到各音素之語音表徵向量,而為了確保碼書中的語音表徵向量與音素一一對應,少量標註資料被用於最大似然訓練。透過語音表徵向量,該模型因而可以有效地執行文句翻語音任務。
透過實驗發現所提出之模型在單語者文句翻語音任務中僅需 20 分鐘之標註資料即可合成人類可識別之語音,在多語者文句翻語音任務中僅需 60 分鐘之標註資料即可媲美利用 25 小時標註資料訓練之監督式學習方法所得的文句翻語音模型。
zh_TW
dc.description.abstractText-to-speech (TTS) is the artificial production of human speech given the input text. Thanks to the great success of speech technology, now we have many TTS systems of powerful deep learning models producing high-quality speech output. Such speech outputs are almost the same as human speech in terms of Mean Opinion Score (MOS) and Multi-Stimulus Test with Hidden Reference and Anchor (MUSHRA). However, training a TTS system of great performance requires many labeled data, and this laborious data collection process is expensive and time-consuming. This requirement prevents many institutes from building great TTS systems. On the other hand, semi-supervised learning methods have recently achieved good results in the fields of natural language processing and speech processing. A large amount of unlabeled data can be utilized to reduce the amount of required labeled data to train the model and improve the performance of the model. Based on the reasons above, this article investigates the effect of semi-supervised learning on the TTS task, and we design experiments to analyze the properties of the proposed model and the effects of different characteristics of the labeled and unlabeled data on the model.
In this paper, we propose a Sequential Quantized Representation Auto-Encoder (SeqQR-AE).
The model consists of an encoder, a phoneme quantizer, and a decoder. The phoneme quantizer contains a phoneme codebook, and this codebook stores a set of vectors where each vector corresponds to a phoneme. SeqQR-AE converts the input audio into a phoneme sequence through the encoder and the codebook and restores the phoneme sequence to audio through the decoder. The model can learn the characteristics of human speech from unlabeled audio through this audio-to-audio reconstruction. In the process of converting audio into phoneme sequence, the phoneme quantizer retrieves the phoneme representations from the phoneme codebook based on a pre-defined codeword-phoneme mapping for each input audio frame. Besides, the maximum likelihood method is adopted to ensure the correctness of this mapping. Having these phoneme representations, the model can perform TTS effectively.
The experiments show that the proposed SeqQR-AE can generate intelligible speech with only 20 minutes of labeled data in the single-speaker setting. As for the multi-speaker TTS setting, the model trained with 60 minutes labeled data can generate outputs comparable with outputs from a supervised learning model trained with 25 hours labeled data.
en
dc.description.provenanceMade available in DSpace on 2021-06-17T06:07:50Z (GMT). No. of bitstreams: 1
U0001-2110202021035400.pdf: 2866109 bytes, checksum: d247e17eec52a30bb896985bdfb5f8ba (MD5)
Previous issue date: 2020
en
dc.description.tableofcontentsContents
口試委員會審定書 - i
中文摘要 - ii
英文摘要 - iv
一、導論 - 1
1.1 研究目的 - 1
1.2 研究方向 - 3
1.3 章節安排 - 4
二、背景知識 - 5
2.1 深層類神經網路(DeepNeuralNetwork,DNN) - 5
2.1.1 模型及原理 - 5
2.1.2 卷積式類神經網路 (Convolutional Neural Network, CNN) - 9
2.1.3 遞歸式類神經網路 (Recurrent Neural Network, RNN) - 11
2.2 鏈結式時序分類器 (Connectionist Temporal Classifier, CTC) - 13
2.2.1 簡介 - 13
2.2.2 方法介紹 - 14
2.3 序列到序列學習 (Sequence-to-sequence Learning) - 15
2.3.1 編碼器解碼器架構 (Encoder-Decoder Architecture) - 15
2.3.2 專注機制(AttentionMechanism) - 17
2.4 基於類神經網路之語音合成模型 (Neural Network Based Text-to- speechModel) - 19
2.4.1 簡介 - 19
2.4.2 Tacotron模型 - 20
2.4.3 Tacotron-2模型 - 22
2.5 本章總結 - 23
三、利用辨識合成串連框架實現半監督式訓練之文句翻語音 - 24
3.1 簡介 - 24
3.2 相關研究 - 26
3.3 辨識合成串連框架 (Cascading Recognition-Synthesis Framework) - 28
3.3.1 語音自編碼器 - 28
3.3.2 聲音分群- 30
3.3.3 序列表徵時序切分 - 32
3.3.4 語音表徵對應 - 32
3.3.5 序列量化表徵自編碼器 - 35
3.4 實驗內容 - 36
3.4.1 實驗設置- 36
3.4.2 實驗內容- 38
3.5 本章總結 - 46
四、利用辨識合成串連框架實現半監督式訓練之多語者文句翻語音合成 - 47
4.1 簡介 - 47
4.2 相關研究 - 48
4.3 多語者序列量化表徵自編碼器 - 49
4.3.1 語音編碼器 - 49
4.3.2 解碼器 - 50
4.3.3 模型訓練- 52
4.4 實驗 - 52
4.4.1 實驗設置- 52
4.4.2 實驗內容- 57
4.5 本章總結 - 63
五、結論與展望 - 65
5.1 研究貢獻與討論 - 65
5.2 未來展望 - 66
5.2.1 未標註文字資料 - 66
5.2.2 一次性多語者語音合成 - 67
5.2.3 邁向非監督式語音合成 - 67
參考文獻 - 68
dc.language.isozh-TW
dc.subject語音合成zh_TW
dc.subject文句翻語音合成zh_TW
dc.subject語音處理zh_TW
dc.subject半監督式學習zh_TW
dc.subject深層學習zh_TW
dc.subjectspeech synthesisen
dc.subjecttext-to-speechen
dc.subjectspeech processingen
dc.subjectdeep learningen
dc.subjectsemi-supervised learningen
dc.title利用序列量化表徵自編碼器實現半監督式學習之文句翻語音合成
zh_TW
dc.titleSemi-supervised Text-to-speech Synthesis Using Sequential Quantized Representation Auto-Encoder
en
dc.typeThesis
dc.date.schoolyear109-1
dc.description.degree碩士
dc.contributor.oralexamcommittee鄭秋豫(Chiu-yu Tseng),王小川(Hsiao-Chuan Wang),李宏毅(Hung-yi Lee),簡仁宗(Jen-Tzung Chien)
dc.subject.keyword語音合成,文句翻語音合成,語音處理,半監督式學習,深層學習,zh_TW
dc.subject.keywordspeech synthesis,text-to-speech,speech processing,semi-supervised learning,deep learning,en
dc.relation.page75
dc.identifier.doi10.6342/NTU202004303
dc.rights.note有償授權
dc.date.accepted2020-10-28
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
U0001-2110202021035400.pdf
  未授權公開取用
2.8 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved