藉助跨語言聲音單位對映之遷移學習達成使用低資源之端到端語音合成及辨識

Yuan-Jui Chen; 陳元瑞

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71683

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山(Lin-shan Lee)
dc.contributor.author	Yuan-Jui Chen	en
dc.contributor.author	陳元瑞	zh_TW
dc.date.accessioned	2021-06-17T06:06:31Z	-
dc.date.available	2020-12-25
dc.date.copyright	2020-12-25
dc.date.issued	2020
dc.date.submitted	2020-11-17
dc.identifier.citation	[1] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,” . [2] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783. [3] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017. [4] Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017. [5] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio, “Char2wav: End-to-end speech synthesis,” 2017. [6] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018. [7] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017, 2018. [8] Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, 2018, pp. 4480–4490. [9] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Junichi Yamagishi, and Tomi Kinnunen, “Can we steal your vocal identity from the internet?: Initial investigation of cloning obama’s voice using gan, wavenet and low-quality found data,” arXiv preprint arXiv:1803.00860, 2018. [10] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376. [11] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012. [12] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1171–1179. [13] Minh-Thang Luong, Hieu Pham, and Christopher D Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015. [14] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016. [15] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems, 2014, pp. 3320–3328. [16] Yusuke Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition.,” in Interspeech. San Francisco, CA, USA, 2016, pp. 2369–2372. [17] Julius Kunze, Louis Kirsch, Ilia Kurenkov, Andreas Krug, Jens Johannsmeier, and Sebastian Stober, “Transfer learning for speech recognition on a budget,” arXiv preprint arXiv:1706.00290, 2017. [18] Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen, “Transfer learning for sequence tagging with hierarchical recurrent networks,” arXiv preprint arXiv:1703.06345, 2017. [19] David McClosky, Eugene Charniak, and Mark Johnson, “Automatic domain adaptation for parsing,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 28–36. [20] Khe Chai Sim and Haizhou Li, “Context-sensitive probabilistic phone mapping model for cross-lingual speech recognition,” in Ninth Annual Conference of the International Speech Communication Association, 2008. [21] Khe Chai Sim, “Discriminative product-of-expert acoustic mapping for cross-lingual phone recognition,” in 2009 IEEE Workshop on Automatic Speech Recognition Understanding. IEEE, 2009, pp. 546–551. [22] Sibo Tong, Philip N Garner, and Hervé Bourlard, “Fast language adaptation using phonological information.,” in Interspeech, 2018, pp. 2459–2463. [23] Kalpesh Krishna, Liang Lu, Kevin Gimpel, and Karen Livescu, “A study of all-convolutional encoders for connectionist temporal classification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5814–5818. [24] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [26] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. [27] Tanja Schultz, “Globalphone: a multilingual speech and text database developed at karlsruhe university,” in Seventh International Conference on Spoken Language Processing, 2002. [28] Munich Artificial Intelligence Laboratories, “The m-ailabs speech dataset,” https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2019. [29] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [30] Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and RJ Skerry-Ryan, “Semi-supervised training for improving data eﬀiciency in end-to-end speech synthesis,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6940–6944. [31] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” Interspeech, pp. 4006–4010, 2017. [32] Daniel Griﬀin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984. [33] Keith Ito et al., “The lj speech dataset,” 2017. [34] Yuan-Jui Chen, Tao Tu, Cheng-chieh Yeh, and Hung-Yi Lee, “End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning,” Proc. Interspeech 2019, pp. 2075–2079, 2019. [35] Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke, “The microsoft 2017 conversational speech recognition system,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5934–5938. [36] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7304–7308. [37] Karel Veselỳ, Martin Karafiát, František Grézl, Miloš Janda, and Ekaterina Egorova, “The language-independent bottleneck features,” in 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2012, pp. 336–341. [38] Ngoc Thang Vu and Tanja Schultz, “Multilingual multilayer perceptron for rapid language adaptation between and across language families.,” in Interspeech, 2013, pp. 515–519. [39] Zoltán Tüske, Joel Pinto, Daniel Willett, and Ralf Schlüter, “Investigation on cross-and multilingual mlp features under matched and mismatched acoustical conditions,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7349–7353. [40] Mark JF Gales, Kate M Knill, Anton Ragni, and Shakti P Rath, “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued,” in Spoken Language Technologies for Under-Resourced Languages, 2014. [41] Herve A Bourlard and Nelson Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer Science Business Media, 2012. [42] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012. [43] Sibo Tong, Philip N Garner, and Hervé Bourlard, “Multilingual training and cross-lingual adaptation on ctc-based acoustic model,” arXiv preprint arXiv:1711.10025, 2017. [44] Li-Juan Liu, Zhen-Hua Ling, Yuan Jiang, Ming Zhou, and Li-Rong Dai, “Wavenet vocoder with limited training data for voice conversion.,” in Interspeech, 2018, pp. 1983–1987.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71683	-
dc.description.abstract	近年來科技日新月異，先進技術伴隨著手機以及穿戴式裝置走入人類的生活，其中一例是語音助理，能藉助語音合成(Text-to-speech)以及語音辨識(Speech Recognition)的技術透過簡單對話完成任務。這兩項技術在近年的類神經網路發展下都獲得了巨大的成功，在合成音質、流暢度以及辨識精準度上一次次得皆超越以往；只是這些模型的訓練都需仰賴大量的標註(Annotated)資料，而這些標註資料需要投入大量的金錢、人力，它對於很多低資源語言來說是無法負擔的。因此，本論文主軸在於探討如何更有效地使用轉移學習(Transfer Learning)技術來幫助那些資源較匱乏的語言建立語音合成以及語音辨識模型，亦即設法讓以高資源語言的充沛標註資料訓練而得的模型可以用在低資源語言上。在跨語言轉移學習中，模型常會遇到輸入端或是輸出端不匹配的問題，像是語音合成模型的輸入端以及語音辨識模型輸出端的語音單位在不同語言是不一樣的，因而造成轉移學習的效果不理想。若我們知道不同語言間語音單位的對映，那我們就可以利用這樣的對映關係解決不匹配的問題以增進模型轉移學習的效果。因此，本論文提出了自動跨語言語音單位對映(Sound Unit mapping)的方法，並應用在最近流行的類神經網路語音合成模型Tacotron以及鏈結式時序分類器語音辨識模型的轉移學習上，並以實驗驗證所找出的對映是否對轉移學習有幫助。本論文使用多種衡量標準，包括衡量語音合成模型的人類受試者主觀(Subjective)語音自然度(Naturalness)評量，或是評斷語音辨識模型結果的客觀(Objective)字母錯誤率(Character Error Rate)。結果都顯示，本論文所提出的自動跨語言語音單位對映可以在這兩個任務上提昇轉移學習的效果，在只有少量標註資料的條件下產生更佳的聲音以及更精準的辨識結果。	zh_TW
dc.description.abstract	In recent years, science and technology have been advancing with time passing. Advanced technologies have come into our lives along with mobile phones and wearable devices. One example is voice assistant, which assists human complete tasks through simple conversations with the help of Text-to-speech and Speech Recognition technologies. These two technologies have achieved great success with the development of neural networks in recent years, and they have surpassed themselves in synthetic sound quality, fluency, and recognition accuracy. However, the training of these models relies on a large amount of annotated data, which requires a lot of money and manpower, which is unaffordable for many low-resourced languages. Therefore, the main point of this paper is to explore how to use Transfer Learning technology more effectively to help those languages with fewer resources to build speech synthesis and speech recognition models. That is, we try to re-use the model trained with abundant annotation data of high-resourced languages on low-resourced languages. In cross-lingual transfer learning, models often encounter input or output mismatch. For example, the sound unit input of the speech synthesis model and the output of the speech recognition model differ across languages, which causes transfer learning not ideal. If we know the mapping of these sound units across languages, then we can use this mapping relation to solve the mismatch problem and improve transfer learning. Therefore, this paper proposes an automatic cross-lingual sound unit mapping method and applies it to transfer learning on the recently popular neural speech synthesis model Tacotron and the temporal connectionist classification speech recognition model. We use experiments to verify whether the mappings are helpful for transfer learning. This paper uses a variety of measurements, including the subjective assessment of the naturalness of speech synthesis model by the human subjects, or the evaluation of the objective Character Error Rate on the speech recognition model results. The results show that the cross-lingual sound unit mapping proposed in this paper can improve the effect of transfer learning on these two tasks and produce better sounds and more accurate recognition results with only a small amount of labeled data.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T06:06:31Z (GMT). No. of bitstreams: 1 U0001-2710202015034700.pdf: 4881903 bytes, checksum: d905e58bcfdfe759639b162da037a490 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書i 中文摘要 ii 英文摘要 iv 一、導論 1 1.1 研究動機 1 1.2 研究方向 3 1.3 章節安排 4 二、背景知識5 2.1 深層類神經網路(DeepNeuralNetwork) 5 2.1.1 簡介 5 2.1.2 卷積式類神經網路 (Convolutional Neural Network) 8 2.1.3 遞歸式類神經網路 (Recurrent Neural Network) 9 2.2 序列到序列學習 (Sequence-to-sequence Learning) 11 2.2.1 簡介 11 2.2.2 編碼器解碼器架構 (Encoder-Decoder Architecture)12 2.2.3 專注機制(Attention Mechanism)13 2.3 鏈結式時序分類器模型 (Connectionist Temporal Classification Model) 15 2.3.1 簡介15 2.3.2 解碼演算法16 2.4 轉移學習(TransferLearning) 17 2.4.1 簡介17 2.4.2 演算法介紹18 2.5 本章總結 20 三、跨語言語音單位對映21 3.1 簡介21 3.2 語音轉換網路模型架構及訓練 22 3.3 語音轉換網路模型推論及語音單位對映 25 3.4 實驗設定及結果分析 25 3.5 本章總結 35 四、藉助跨語言轉移學習實現低資源語言的端到端語音合成 36 4.1 簡介36 4.2 語音合成模型:Tacotron 37 4.2.1 簡介 37 4.2.2 編碼器模組37 4.2.3 解碼器模組39 4.2.4 後處理模組40 4.3 語音合成模型的跨語言轉移學習 40 4.3.1 從頭學習方法 43 4.3.2 專家知識方法43 4.3.3 對映學習方法43 4.4 轉移學習訓練過程及設定44 4.4.1 資料集設定44 4.4.2 模型設定以及資料集預處理 45 4.4.3 模型訓練 46 4.5 實驗結果及分析46 4.5.1 客觀評量47 4.5.2 主觀評量 48 4.6 本章總結 50 五、藉助跨語言轉移學習實現低資源語言的端到端語音辨識 51 5.1 簡介 51 5.2 語音辨識模型的跨語言轉移學習 53 5.2.1 從頭學習方法54 5.2.2 專家知識方法55 5.2.3 對映學習方法55 5.3 轉移學習訓練過程及設定56 5.3.1 資料集設定56 5.3.2 模型設定及資料集前處理57 5.3.3 模型訓練 57 5.4 實驗結果與分析58 5.4.1 音素到音素設定 58 5.4.2 音素到字母設定61 5.4.3 字母到字母設定64 5.5 本章總結 67 六、結論與展望 68 6.1 研究貢獻與討論68 6.1.1 自動跨語言語音單位對映 68 6.1.2 語音合成模型的跨語言轉移學習 68 6.1.3 語音辨識模型的跨語言轉移學習 69 6.2 未來展望 69 6.2.1 提昇合成語音品質 69 6.2.2 低資源語言的類神經網路聲碼器 69 6.2.3 在更多語言上實驗語音辨識模型的轉移學習 70 參考文獻 71
dc.language.iso	zh-TW
dc.subject	語音合成	zh_TW
dc.subject	語音辨識	zh_TW
dc.subject	Speech synthesis	en
dc.subject	Speech recognition	en
dc.title	藉助跨語言聲音單位對映之遷移學習達成使用低資源之端到端語音合成及辨識	zh_TW
dc.title	Low-resourced End-to-end Speech Synthesis and Recognition by Transfer Learning with Cross-lingual Sound Unit Mapping	en
dc.type	Thesis
dc.date.schoolyear	109-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	鄭秋豫(Chiu-yu TSENG),李宏毅(Hung-Yi Lee),王小川(Hsiao-Chuan Wang),簡仁宗(Jen-Tzung Chien),陳信宏(Sin-Horng Chen)
dc.subject.keyword	語音合成,語音辨識,	zh_TW
dc.subject.keyword	Speech synthesis,Speech recognition,	en
dc.relation.page	77
dc.identifier.doi	10.6342/NTU202004310
dc.rights.note	有償授權
dc.date.accepted	2020-11-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-2710202015034700.pdf 未授權公開取用	4.77 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。