請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93471| 標題: | 自發性華語語音合成之非監督式學習 Spontaneous Chinese Speech Synthesis by Unsupervised Learning |
| 作者: | 林則仰 Tse-Yang Lin |
| 指導教授: | 李琳山 Lin-Shan Lee |
| 關鍵字: | 自發性語音,自監督式模型,非監督式學習, Spontaneous Speech,Self-supervised Model,Unsupervised Learning, |
| 出版年 : | 2024 |
| 學位: | 碩士 |
| 摘要: | 語音合成(Speech Synthesis)技術是指將文句(Text)作為輸入並由機器將之轉換為語音(Speech)信號的任務,在許多生活中的應用都能看到這項技術的身影,如語音助理、機器翻譯系統等。受惠於現今發展完整的深層類神經網路強大的能力,這項技術在生成語音的品質上已非常接近真實人聲。然而,這些由機器產生的語音雖然能夠生成高品質的語音,卻仍與真實的人聲在實際聽感上有明顯的差異。造成這個現象的原因為現今語音合成模型所產生的語音信號多為「朗讀」風格,在訓練時就利用大量高品質的朗讀錄音進行學習,導致這些語音與一班下現實日常對話的語音仍有顯著的不同,因為後者之中常帶有豐富的個人口語習慣或情緒等。這樣較為自然的語音也被稱為自發性語音(Spontaneous Speech)。有鑑於此,本論文對自發性語音合成系統做了深入的探討,針對資料標註短少的問題以及維持生成語音的品質兩個方向提出改善方法,同時提升模型的強健度以及降低訓練的難度及成本,並達成以非監督式學習(Unsupervised Learning)的方法來完成對自發性語音合成模型的訓練。
本論文首先提出自發性語音特徵自動分類架構,來解決過往相關研究中需要大量人力成本對自發性特徵的類別進行標註的問題。此一架構以語音自監督式模型為核心,將每一段帶有自發性特徵的語音片段都轉換為獨立的向量,並利用分群演算法對這些向量進行適當的分類,以得到自發性語音的虛擬標籤(Pseudo Label)。利用分類架構得到的標註資料,本論文提出兩種方法來使模型學習到如何生成語音當中帶有自發性特徵的部分,分別是自發性特徵預測器以及風格轉換模型,這兩種方法成功將自發性特徵混入生成的語音之中。此外,利用自發性語音訓練語音合成模型時,經常會因為搜集得來的自發性語音品質往往不如朗讀式特別錄製的語料,容易導致模型生成的語音品質明顯的下滑。本論文也針對這個問題進行改善,研究發現若是在訓練過程中加入預訓練的過程,或是在計算損失函數時額外考慮一種一致性損失,都能夠有效地克服品質下降的情形,並提升模型的強健性。 Speech synthesis technology, which converts text input into speech signals, has been widely used in many everyday applications such as voice assistants and machine translation systems. Thanks to the super-strong deep neural networks, such technology today can generate speech signals with a quality nearly indistinguishable from real human voices. However, despite the high quality, machine-generated voices today still noticeably differ from true human speech in actual perception. This discrepancy mainly arises from the fact that current speech synthesis models are often trained by "read speech" datasets recorded under controlled conditions, thus in lack of the naturalness existing in spontaneous speech that people usually produce in their everyday conversations. The latter is rich of personal idiosyncrasies and carries emotion. This thesis delves into spontaneous speech synthesis, addressing two challenging problems: sparse data labeling and maintaining speech quality. In this work we propose approaches to enhance the model robustness and reduce the training difficulty and costs, achieving training of spontaneous speech synthesis models using unsupervised learning methods. First, we present an automatic classification architecture for identifying spontaneous speech features, aiming to reduce the human labor needed for categorizing these traits. With a speech self-supervised model, this architecture transforms each speech segment with spontaneous features into independent vectors, which are then properly classified using clustering algorithms to generate pseudo labels for spontaneous speech. Using these labels, two methods learning and incorporating spontaneous features into generated speech are proposed: a spontaneous feature predictor and a style transfer model. Moreover, speech synthesis models learned with spontaneous speech often suffer from a significant degradation in speech quality, as the quality of the collected spontaneous speech is generally not as good as that of read speech data recorded under controlled conditions. This thesis addresses this issue by incorporating a pre-training phase or adding a consistency loss in training. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93471 |
| DOI: | 10.6342/NTU202402056 |
| 全文授權: | 同意授權(限校園內公開) |
| 顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 2.03 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
