類神經網路聲碼器應用於語音增強上之可行性探索

楊舒涵; Shu-Han Yang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88903

標題:	類神經網路聲碼器應用於語音增強上之可行性探索 Exploring Feasibility of Using Neural Vocoders for Speech Enhancement
作者:	楊舒涵 Shu-Han Yang
指導教授:	李琳山 Lin-Shan Lee
關鍵字:	語音增強,類神經網路聲碼器,生成模型,深層學習,非監督式學習,語音生成, speech enhancement,neural vocoder,generative model,deep learning,unsupervised learning,speech synthesis,
出版年 :	2023
學位:	碩士
摘要:	語音增強是個被長久研究的議題，儘管目前最先進的端到端技術已經能解決許多傳統上的難題，但是這些技術受制於常需要成對訓練音檔，在朝向普遍化應用時，仍有需要解決的問題；而隨著chatGPT的爆紅，生成式AI技術應是實現通用性語音增強技術的一種合理期待。類神經聲碼器是近年來有突破性進展的語音生成模型，它們藉由大量乾淨語音樣本的訓練，可以將較低維度的聲學特徵轉換回音檔，而網路上也陸續有許多高品質的現成類神經聲碼器被釋出。本論文擬研究幾種不同具代表性的現成聲碼器:WaveNet、WaveRNN和WaveGlow，想探討聲碼器既然是由大量乾淨語音樣本訓練出來的，是否由此習得乾淨音檔的特徵，因而可以用於語音增強，並分析其效果。實驗結果顯現，WaveNet在任何評量方法下都沒有語音增強的效果，且生成速率極慢，故不適用於語音增強。WaveRNN和WaveGlow能在MCD上明顯拉近帶雜訊的混合音檔和乾淨音檔的MFCC向量值，而WaveRNN較不受制於其訓練集。其中，如果處理在的語音SNR值較小或夾雜的是白雜訊，則聲碼器的表現較佳，而如果夾雜的是話語類雜訊，則表現較差，至於語者差異則不太影響其表現結果。綜合分析顯示，現成聲碼器較不適用於直接進行系統後端的語音增強，至於資料前處理時的語音增強，則可以從資料集來源和對生成速率的要求，選擇WaveRNN或WaveGlow。另外，語音增強的效果顯然受諸多因素影響，包括聲碼器本身的架構。從實驗結果可以發現， WaveGlow的glow架構可以用在通用性非監督式語音增強技術的模型設計上，而訓練集如果能用更豐富的多語者資料，應會有較佳效果。 Speech enhancement has been a long-standing topic of research, and while state-of-the-art end-to-end techniques have been able to solve many traditional problems, they are still constrained by the need for paired training data. When moving towards universal speech enhancement, there are still issues we need to address. With chatGPT being so popular, generative AI techniques offer a reasonable expectation for achieving generalized speech enhancement. In recent years, breakthroughs have been made in neural vocoders, speech synthesis models trained on abundant clean speech samples, and can convert low-dimensional acoustic features back into audio files. There are also many high-quality pre-trained neural vocoders available online. This thesis aims to investigate several representative pre-trained vocoders: WaveNet, WaveRNN, and WaveGlow, to explore whether these vocoders, trained on a large number of clean speech samples, have learned the characteristics of clean audio files and can be utilized for speech enhancement, as well as to analyze their effectiveness. The experimental results reveal that WaveNet has no effect on speech enhancement under any evaluation method, and its inference speed is extremely slow, making it unsuitable for speech enhancement. WaveRNN and WaveGlow demonstrate a significant reduction in mel-frequency cepstral coefficient (MFCC) values between the noisy and clean audio files, as measured by mel-cepstral distortion (MCD), while WaveRNN shows less dependency on its training set. Specifically, the vocoders perform better when dealing with low signal-to-noise ratio (SNR) speech or white noise, but perform poorer when dealing with speech-like noise. Speaker variability has minimal impact on the vocoders' performance. Comprehensive analysis indicates that pre-trained vocoders are less suitable for direct system-level speech enhancement. As for speech enhancement during data preprocessing, the choice between WaveRNN and WaveGlow depends on the testing data source and the requirement for inference speed. Additionally, the effectiveness of speech enhancement is evidently influenced by various factors, including the architecture of the vocoder itself. From the experimental results, it can be observed that the glow architecture of WaveGlow holds promise for the design of generalized unsupervised speech enhancement techniques. Moreover, using a more diverse multi-speaker training dataset is likely to yield better results.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88903
DOI:	10.6342/NTU202301620
全文授權:	同意授權(限校園內公開)
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	3.71 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。