請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71727
標題: | 利用序列量化表徵自編碼器實現半監督式學習之文句翻語音合成 Semi-supervised Text-to-speech Synthesis Using Sequential Quantized Representation Auto-Encoder |
作者: | Tao Tu 杜濤 |
指導教授: | 李琳山(Lin-shan Lee) |
關鍵字: | 語音合成,文句翻語音合成,語音處理,半監督式學習,深層學習, speech synthesis,text-to-speech,speech processing,semi-supervised learning,deep learning, |
出版年 : | 2020 |
學位: | 碩士 |
摘要: | 文句翻語音是指將輸入文句轉換成語音的任務。透過能力強大的深層類神經網路,現今文句翻語音技術在許多評分上已與真實人聲幾乎無異。令人遺憾的是,訓練一個品質良好的文句翻語音模型需要大量文字與音訊之配對資料(標註資料),而收集標註資料的過程既費時又需高成本。另一方面,半監督式學習方法近來在許多自然語言處理及語音處理任務上均獲得良好結果。可以利用大量未標註資料來降低訓練模型所需之標註資料量,並提升模型之表現。綜合上述原因,本論文探討半監督式學習於文句翻語音任務上之效果,並以實驗分析所提出模型之性質以及標註資料和未標註資料之特性對於模型效果之影響。 本論文提出序列量化表徵自編碼器。該模型由編碼器、語音量化器以及解碼器所組成,其中量化器內含碼書,而碼書存有對應到各音素之語音表徵向量。序列量化表徵自編碼器透過編碼器及量化器將輸入音訊轉換成音素序列,並透過解碼器將音素序列還原成音訊,完成音訊重構。透過音訊重構,模型可自未標註音訊中學習人類語言之發音特性。在將音訊轉換成音素序列的過程中,量化器會自碼書中拿取對應到各音素之語音表徵向量,而為了確保碼書中的語音表徵向量與音素一一對應,少量標註資料被用於最大似然訓練。透過語音表徵向量,該模型因而可以有效地執行文句翻語音任務。 透過實驗發現所提出之模型在單語者文句翻語音任務中僅需 20 分鐘之標註資料即可合成人類可識別之語音,在多語者文句翻語音任務中僅需 60 分鐘之標註資料即可媲美利用 25 小時標註資料訓練之監督式學習方法所得的文句翻語音模型。 Text-to-speech (TTS) is the artificial production of human speech given the input text. Thanks to the great success of speech technology, now we have many TTS systems of powerful deep learning models producing high-quality speech output. Such speech outputs are almost the same as human speech in terms of Mean Opinion Score (MOS) and Multi-Stimulus Test with Hidden Reference and Anchor (MUSHRA). However, training a TTS system of great performance requires many labeled data, and this laborious data collection process is expensive and time-consuming. This requirement prevents many institutes from building great TTS systems. On the other hand, semi-supervised learning methods have recently achieved good results in the fields of natural language processing and speech processing. A large amount of unlabeled data can be utilized to reduce the amount of required labeled data to train the model and improve the performance of the model. Based on the reasons above, this article investigates the effect of semi-supervised learning on the TTS task, and we design experiments to analyze the properties of the proposed model and the effects of different characteristics of the labeled and unlabeled data on the model. In this paper, we propose a Sequential Quantized Representation Auto-Encoder (SeqQR-AE). The model consists of an encoder, a phoneme quantizer, and a decoder. The phoneme quantizer contains a phoneme codebook, and this codebook stores a set of vectors where each vector corresponds to a phoneme. SeqQR-AE converts the input audio into a phoneme sequence through the encoder and the codebook and restores the phoneme sequence to audio through the decoder. The model can learn the characteristics of human speech from unlabeled audio through this audio-to-audio reconstruction. In the process of converting audio into phoneme sequence, the phoneme quantizer retrieves the phoneme representations from the phoneme codebook based on a pre-defined codeword-phoneme mapping for each input audio frame. Besides, the maximum likelihood method is adopted to ensure the correctness of this mapping. Having these phoneme representations, the model can perform TTS effectively. The experiments show that the proposed SeqQR-AE can generate intelligible speech with only 20 minutes of labeled data in the single-speaker setting. As for the multi-speaker TTS setting, the model trained with 60 minutes labeled data can generate outputs comparable with outputs from a supervised learning model trained with 25 hours labeled data. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71727 |
DOI: | 10.6342/NTU202004303 |
全文授權: | 有償授權 |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
U0001-2110202021035400.pdf 目前未授權公開取用 | 2.8 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。