以深度學習為基礎的語音評估標準與應用

李安德; Ryandhimas Edo Zezario

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91293

標題:	以深度學習為基礎的語音評估標準與應用 Deep Learning-based Speech Assessment Metrics and its Applications
作者:	李安德 Ryandhimas Edo Zezario
指導教授:	傅楸善 Chiou-Shann Fuh
共同指導教授:	曹昱 Yu Tsao
關鍵字:	深度學習,多目标学习,言語評估,語音增強, non-intrusive speech assessment models,deep learning,multi-objective learning,speech enhancement,
出版年 :	2023
學位:	博士
摘要:	大多數的傳統語音評估指標需要一個乾淨語音作為參考來計算評估分數。然而現實生活中並非總能獲取乾淨語音，導致這樣的應用受到了限制。為了解決這個限制，非侵入式語音評估指標近年來受到廣泛關注。隨著深度學習模型的出現和可用的訓練數據，許多研究開始使用深度學習模型來作為非侵入式語音評估模型。然而，儘管深度學習的語音評估模型有良好的表現，但模型的泛化仍然是一個挑戰，因此本論文提出了幾種方法，用來提高以深度學習為基礎之非侵入式語音評估模型的預測能力。在第一個方法中，我們研究適合的模型架構來提高語音理解度的預測分數。實驗結果證實，使用具有乘法注意機制的卷積神經網絡和雙向長短期記憶（CNN-BLSTM）架構可以達到比CNN、DNN和BLSTM更高的預測分數。在第二個方法中，我們假設當提供豐富的聲學特徵可以幫助模型學習更多有用的信息，因此，我們引入了跨領域特徵，包括頻域和時域特徵的組合以及自監督學習（SSL）模型的嵌入特徵。此外，我們提出了一個多任務學習模型，相較於僅預測一種評估指標，我們基於深度學習建立預測多目標的語音評估模型。實驗結果證實了跨領域特徵相對於單一類型特徵提供了更豐富的聲學信息。此外，我們還確認多任務學習對提高語音評估模型的預測能力有潛在的優勢。此外，因為不總是有足夠的訓練數據可用，因此設計一種在有限的訓練數據下仍可以實現良好預測性能的方法將會帶來很多好處。基於這個考量，我們提出了一種知識轉移策略，利用教師模型初始化學生模型的權重參數。此外，我們還研究了在執行多任務學習時應該使用哪些評估指標。實驗結果證實了通過採用知識轉移策略有更好的預測性能。此外，我們的提出方法可以藉由在使用多任務學習時選擇更多相關的評估指標來獲得更好的預測性能。為了讓以深度學習為基礎的語音評估模型有更好的預測性能，我們引入了一個改進的跨領域特徵組合，利用一個弱監督模型，即Whisper。與原始的跨領域特徵相比，更新後的跨領域特徵組合實現了更高的預測性能，表明了弱監督模型提供強大聲學特徵的潛在優勢。此外，我們還提出了一種基於多分支模型和跨領域特徵的新方法來處理雙耳聲學特徵，並用於預測配戴聽力輔助設備的語音理解度，該方法可達到出色的預測性能。最後，我們通過提出零樣本模型選擇（ZMOS）和質量-理解度感知SE（QIA-SE）兩個方法，直接整合了語音增強與語音評估指標。實驗結果證實，這兩種方法都可以有效提升性能，其中QIA-SE相較於其他兩種基準系統在ZMOS系統上有更優好的表現。 Most conventional speech assessment metrics require a golden clean reference to calculate the evaluation score. Such a scenario has limited applicability in real-world scenarios since clean reference is not always accessible. To address this limitation, non-intrusive speech assessment metrics have caught great attention in recent years. Recently, with the emergence of the deep learning model and the availability of training data, many studies have involved the deep learning model to deploy a non-intrusive speech assessment model. However, despite the good performance achieved by the deep learning-based speech assessment model, the generalization of the model remains a challenge. Therefore, this thesis aims to improve the prediction capability of a deep learning-based non-intrusive speech assessment model by proposing several approaches. In the first proposed approach, we investigate a suitable model architecture for a higher prediction score of speech intelligibility. Experimental results confirm that a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism can achieve higher prediction scores than CNN, DNN, and BLSTM. For the second approach, we assume that providing rich acoustic features may help the model learn more useful information; as such, we introduce cross-domain features consisting of a combination of spectral and time-domain features and the embedding features from the self-supervised learning (SSL) model. Along with that, rather than only predicting one type of assessment metric, we propose a multi-task learning model for deploying a deep learning-based multi-objective speech assessment model. Experimental results confirm the advantages of cross-domain features over single-type features for richer acoustic information. Besides, we also confirm the potential advantages of multi-task learning for improving the prediction capabilities of the speech assessment model. However, as sufficient training data is not always available, designing a method that can achieve good prediction performance under a limited amount of training data will be beneficial. Based on such concern, we propose a knowledge transfer strategy in which the weight parameter from the teacher model is initialized for the student model. Along with that, we also study which assessment metrics should be employed while performing multi-task learning. Experimental results confirm better prediction performance by employing a knowledge transfer strategy. Besides, our proposed approach can achieve better prediction performance by selecting more related assessment metrics while performing multi-task learning. In light of achieving better prediction performance of the deep learning-based speech assessment model, we introduce an improved version of the proposed cross-domain features by leveraging a weakly supervised model, namely Whisper. Compared to the original cross-domain features, the updated combination of the cross-domain feature can achieve higher prediction performance, indicating the potential advantages of a weakly supervised model for providing robust acoustic features. Furthermore, we also propose a novel method based on the multi-branched model and cross-domain features that handle binaural acoustic features and deploy a speech intelligibility prediction model for hearing aids. The result of this approach can achieve competitive prediction performance compared to the other methods. Finally, we design a direct integration of speech assessment metrics with speech enhancement by proposing zero-shot model selection (ZMOS), and quality-intelligibility (QI)-aware SE (QIA-SE) approaches. Experimental results confirm that both methods can achieve notable enhancement performance, where QIA-SE shows superior performance compared to the ZMOS systems and two additional baseline systems.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91293
DOI:	10.6342/NTU202304242
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	4.83 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。