增進語音增強於未知噪聲泛用性-運用整合性多屬性回歸模型

Cheng Yu; 于晟

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74263

標題:	增進語音增強於未知噪聲泛用性-運用整合性多屬性回歸模型 Improving Generalization Over Unseen Noise for Speech Enhancement with Ensemble of Multi-branched Regressions
作者:	Cheng Yu 于晟
指導教授:	簡韶逸(Shao-Yi Chien)
關鍵字:	語音增強,神經網路,泛用性,動態大小決策樹,多屬性回歸模型, Speech Enhancement,Neural Networks,Generalizability,Dynamic-Sized Decision tree,Ensemble of Multi-branched Regression,
出版年 :	2019
學位:	碩士
摘要:	語音增強演算法是應用於除去影響語音訊號成份(通常是伴隨語音收錄到的雜訊)的一種語音技術。近年來，由於深度學習演算法的蓬勃發展，語音增強演算法的除噪能力也大幅的進步。但是，在進步之餘，基於深度學習的語音增強演算法仍然有兩個技術難點需要克服。這些問題常發生於將此演算法應用於現實世界的噪音環境時，考驗演算法的通用性。首先，面對未知類型噪音環境時，語音增強演算法常無法發揮在與訓練資料相似噪音時的良好表現。然而，由於現實世界噪音環境的種類不勝枚舉，也無法盡數收集紀錄，故無法針對每一種噪音都做好預先準備。更甚者，演算法在面對某些特定噪音環境時，即使已經在訓練時加入這類噪聲，演算法可能會失去其訓練時所被預期的除噪針對能力。這是由於，當我們使用多種噪音環境與乾淨語音配對設計演算法時，深度學習模型無法同時針對多種不同噪聲的環境進行除噪優化，而會偏重於某些在數值特性上較明顯的噪音環境進行收斂，而忽略了其他的噪聲環境。在此篇論文中，我們提出一個創新的整合性多屬性回歸模型演算法 (EMR) 以解決上述兩項困難。此演算法由兩階段所組成，分別是訓練階段與應用階段。在訓練階段時，我們以一個動態分類的樹狀分支為指引來訓練多個不同屬性的回歸模型。另外，我們使用一個模型去融合前述的多個回歸模型產生的結果，並在此模型輸出增強後的語音訊號頻譜。實驗結果顯示我們的整合性多屬性回歸模型演算法不僅在多項客觀的評分算法中脫穎而出，更是在實際測試的主觀聽力測驗中得到優異的結果。 Speech enhancement (SE) algorithm aims to prune undesired contents within a noise contaminated speech. Recent years, deep learning based SE algorithms acquired huge improvements in denoising performances. Despite its great improvements from past SE algorithms, two problems are still left unsolved. These problems are highly related to the algorithm’s generalizability towards real-world noise scenarios: (1) mismatched noise conditions, i.e., sub-optimal performances occur when models are tested under unseen noise types, which couldn’t be involved in training corpus; (2) out of focus on some specific noise types, i.e., models trained with multiple noise types cannot sustain optimal enhancement among some specific noise types even though these noises are involved in training set. In this study, we propose a novel ensemble of multi-branched regression (EMR) model to deal with the two problems. The EMR model is consisted of two stages which function as offline and online. In the offline stage, a dynamic-sized decision tree (DSDT) based model is built to guide the implementation of the ensemble networks. The DSDT is built of speech attributes (from prior knowledge of speech and noise conditions, including speaker and environment factors.). Groups of SE networks perform particular mapping functions from noisy to clean speech according to the branches of DSDT. Finally, a fusion model is trained based on these speech mappings. In the online stage, noisy speech is first fed to the ensemble networks. Then the multiple outputs are integrated by the fusion model to finalize the enhancement. Experimental results of EMR show rather good improvements versus baseline models not only in objective metrics evaluation, but also in intelligibility by subjective human listening tests.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74263
DOI:	10.6342/NTU201903110
全文授權:	有償授權
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	2.95 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。