提高語音辨識系統的適用性：無監督語音辨識和語境語音辨識

劉達融; Da-Rong Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91788

標題:	提高語音辨識系統的適用性：無監督語音辨識和語境語音辨識 Improving the Applicability of Automatic Speech Recognition (ASR) Systems: Unsupervised ASR and Contextual ASR
作者:	劉達融 Da-Rong Liu
指導教授:	李宏毅 Hung-yi Lee
關鍵字:	語音辨識,無監督式學習,語境語音辨識,生成對抗網路,語言模型, speech recognition,unsupervised learning,contextual ASR,generative adversarial network,language model,
出版年 :	2022
學位:	博士
摘要:	由於深度學習的發展，語音辨識系統在近期已取得了不錯的成果，但在許多情況下語音辨識系統並不總是可以表現的這麼好。首先，語音辨識系統的訓練依賴於大量的配對語音和文字資料，這對於全球超過95%的低資源語言是難以獲得的。其次，在不同語境情境下，使用的文字有不同的偏好以及可能出現特定情境下的罕見字詞，語音辨識系統無法在不同語境情境下表現的很好。在本論文中，我們希望提高語音辨識系統在不同語言、不同語境的適用性。因為搜集大量未標註的資料相比於搜集大量的配對資料更容易，我們是否有可能基於非配對的語音和文字資料訓練一個無監督語音辨識系統呢？如果這個技術獲得成功，將可以使得訓練語音辨識系統的成本大幅下降，讓低資源語言也可以享有高品質的語音辨識。本論文是世界上第一次成功的無監督語音辨識的嘗試，我們提出了一種兩階段迭代框架來實現無監督語音辨識系統。在框架的第一階段，文字先藉由辭典轉換成音素序列，然後生成對抗網路則被用來找到未標註語音到音素序列的對應關係。在框架的第二階段，我們引入一個隱式馬可夫模型訓練在生成對抗網路的輸出，進一步提高表現並為下一次生成對抗網路訓練提供更好的音素分割。本論文探索了不同的生成對抗網路架構。首先，受到廣泛使用的語音單元技術的啟發，我們提出從語音生成離散語音單元並通過生成對抗網路將其對應到音素，成功地實現無監督音素辨識。然而，我們發現上述無監督方法的表現受到所生成的離散語音單元的品質的限制。為了解決這個問題，我們進一步提出了不依賴於這些離散語音單元的新生成對抗網路架構。最終，我們的迭代框架可以在基準語料庫 TIMIT 上達到36.71%的音素錯誤率，在2021年以前這是 TIMIT上無監督音素錯誤率最低的紀錄。接下來為了提高語音辨識系統在不同語境情境中的適用性，本論文研究了使用文字語境資訊來提高語音辨識系統表現的方法。過去的相關研究主要專注於使用通訊錄列表等做為語境資訊並用在數位助理相關的語音辨識任務上，本論文的目標是以社群媒體上使用者上傳的影片說明來提升影片內容語音辨識正確率，不同於過去的研究我們需要辨識更多樣化的內容，且語音辨識系統需要使用長篇文字段落當作語境資訊。我們所提出的模型包含注意模型，用於總結資訊，以及指針網路，用於在文字語境資訊中選擇正確的罕見字詞。提出的模型在已經使用上萬小時配對語音和文字資料進行訓練的商用系統上可以達到5%的相對字錯誤率進步。 Automatic speech recognition (ASR) has achieved remarkable performance due to the development of deep learning, but it is not always effective in all cases. First, the training of ASR systems relies on large amounts of paired speech and text data, which is difficult to obtain for at least 95% of the languages over the world which are low-resourced. Second, an ASR system can not easily adapt to different contextual scenarios due to variations in word preferences and the occurrence of rare words. In this thesis, we aim to improve the applicability of ASR systems for different languages and contextual scenarios. Since it is easier to collect a large amount of unlabeled data than a large amount of paired data, is it possible to train an unsupervised speech recognition system based on unpaired speech and text data? If this technology is successful, the cost of training the speech recognition system will be greatly reduced, and low-resource languages can also enjoy high-quality speech recognition. This thesis is the world’s first successful attempt at unsupervised speech recognition. This thesis presents a two-stage iterative framework to achieve unsupervised ASR. In the first stage, text is transformed into phoneme sequences by a lexicon, and a generative adversarial network (GAN) is employed to find the mapping from unannotated speech to phoneme sequences. In the second stage, we introduce a hidden Markov model (HMM) to train on the GAN’s output, further improving performance and providing better phoneme segmentation for the next iteration of GAN training. Different GAN architectures are explored. Inspired by the widely used technique of identifying acoustic tokens from speech, we propose a GAN architecture that successfully performs unsupervised phoneme recognition by generating discrete acoustic tokens from speech and learning the mapping to phonemes through the GAN. However, we find that the performance of this approach is limited by the quality of the generated discrete acoustic tokens. To address this issue, we propose new GAN architectures that do not rely on using these tokens. Our iterative framework achieves a phoneme error rate of 36.71% on TIMIT, which has been the state-of-the-art of unsupervised ASR before 2021. Next, to improve the applicability of ASR systems in different contexts, this thesis studies the method of using textual contextual information to enhance the performance of ASR. Past related research has mainly focused on using contact lists as contextual information and using them in speech recognition tasks related to digital assistants. This thesis aims to improve speech recognition of video content by using video descriptions uploaded by users on social media. Unlike previous studies, ASR needs to recognize more diverse content, and it has to use long text paragraphs as contextual information. Our proposed model consists of an attention model for summarizing information and a pointer network for selecting the correct rare words in textual contextual information. The proposed model can achieve 5% relative word error rate improvement on a commercial system that has been trained using tens of thousands of hours of paired speech and text data.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91788
DOI:	10.6342/NTU202304180
全文授權:	同意授權(全球公開)
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	5.31 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。