改善殘響環境中的自動語音辨識

廖彥綸; Yen-Lun Liao

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87902

標題:	改善殘響環境中的自動語音辨識 Improving ASR in Reverberant Environments
作者:	廖彥綸 Yen-Lun Liao
指導教授:	張智星 Jyh-Shing Jang
關鍵字:	自動語音辨識,殘響去除,生成對抗模型,資料擴增,動態規劃, automatic speech recognition,dereverberation,GAN,data augmentation,dynamic programming,
出版年 :	2023
學位:	碩士
摘要:	本研究統合殘響去除和語音辨識系統，以增進語音辨識對具有殘響的語音訊號的辨識能力。當今常用語音辨識結合殘響去除的架構簡易地將通過殘響去除模型的訊號作為下階段語音辨識模型的輸入。這種方式雖然看似簡單明確，但存在殘響去除模型的輸出與語音辨識聲學模型的訓練資料特性不一致的問題，而可能使最後辨識的效果有所下降。本文提出新的架構，並藉由四個面向改善原始的語音辨識系統： 1. 乾淨與殘響特性分類器 2. 殘響去除模型 3. 避免訓練及測試語音資料特性不一致 4. 模型組合當訊號進入本架構後，使用各種不同的資料訓練，將進入針對語音品質的乾淨與殘響特性分類器，以讓訊號選擇適合的聲學模型。在此分類器忠本研究探討 LCNN 搭配 MAX、AVG、SAP、ASP 等池化層的訓練架構的差異；根據分類器結果，被分類致殘響類別的語音訊號將經過殘響去除模型後，由利用相同性質的資料訓練的聲學模型辨識，避免特性不一致的問題。乾淨的語音訊號也會是更較適合的對應的方式辨識。最後使用模型組合方法，本研究提出 sentence-level fusion (SLF) 及 word-level fusion (WLF) 的組合方式以尋求更低的字元錯誤率 (CER)。根據實驗結果，在自行生成的 aishell1 具有殘響的語音訊號測試集中使用 MetricGAN 去除殘響後，經過 tdnn 的聲學模型訓練中字元錯誤率由原本的 15.26% 降低至 13.22 % ；Mixup 的資料擴增實驗在 triphone 的聲學模型中，和原始的訓練方式相比，字元錯誤率由 17.83% 降低至 17.34%；在自製的 aishell1 乾淨和含有殘響語音訊號的混合測試集中，使用 sentence-level fusion 或 word-level fusion 的模型混合方式的字元錯誤率為 7.23% ，相較於直接使用乾淨訊號的聲學模型 (CER = 9.12%) 或含有殘響的語音訊號的聲學模型 (CER = 7.85%) 有最多 20.72 % 的錯誤率減少。 The emergence of reverberations usually corrupts the quality of indoor signals, giving rise to performance degradation in automatic speech recognition (ASR). To minimize the influence of reverberations, acoustic dereverberation models are established to pre-process the original signals before submitting them to ASR. This structure leads to an apparent improvement. However, the dereverberation model's output is inconsistent with the training dataset of ASR, resulting in the decline in the performance of ASR. This paper refined the previous structure from four aspects: signals classification, reverberation removal, data mismatch offset, and string fusion. As soon as the audio stream was submitted to the proposed system, the reverberation classifier determined whether the signal was clean or reverberated. Depending on whether the signal was counted as reverberation, the system will submit the signal to a dereverberation model before sending it to ASR or will send the signal to ASR directly. The routine selection also helped the signal choose the most proper acoustic model (AM) whose establishment is trained using the audio stream with the corresponding acoustic label. Furthermore, this paper proposed sentence-level fusion (SLF) and word-level fusion (WLF) as methods to fuse the two results. By dereverberating the signals with MetricGAN, the character error rate (CER) had decreased from 15.26% to 13.22% in the rev-aishell1 test set with using tdnn acoustic model. With the help of mixup augmentation, CER decreased from 17.83% to 17.34% in triphone acoustic model. When SLF and WLF were applied, a CER of 7.23% was reached in the reverberant and clean aishell1 test set, achieving an improvement in the CER by 20.72% compared to the single model.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/87902
DOI:	10.6342/NTU202300727
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	1.78 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。