請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86964
標題: | 基於自監督式語音表徵之無參考客觀語音品質評估 No-reference Objective Speech Quality Assessment Based on Self-supervised Speech Representations |
作者: | 曾韋誠 Wei-Cheng Tseng |
指導教授: | 李琳山 Lin-shan Lee |
關鍵字: | 深度學習,自監督式語音表徵,語音處理,語音品質評估,無參考客觀語音品質評估, Deep Learning,Speech Processing,Self-supervised Speech Representation,Speech Quality Assessment,No-reference Objective Speech Quality Assessment, |
出版年 : | 2022 |
學位: | 碩士 |
摘要: | 語音品質評估(Speech Quality Assessment)多年來,一直是語音處理(Speech Processing)領域的重要課題。傳統上,經許多人聆聽後所獲得的平均主觀意見分數(Mean Opinion Score)一直是語音品質評估的金科玉律,但由於需舉辦聆聽測驗來獲取許多受測者對於待測語音訊號的主觀評分,因而必須耗費大量的人力與時間。另一方面,確有多項基於模擬人類聽覺系統所發展而來的全參考客觀語音品質評估方法(Full-reference Objective Speech Quality Assessment)被普遍使用,並證實與平均主觀意見分數成高度相關。然而,由於這些方法中需要乾淨真實的參考訊號作為待測訊號的比較對象,使得它們在無法取得參考訊號的情況下無法使用。因此,開發一套無參考客觀語音品質評估方法(No-reference Objective Speech Quality Assessment),也就是不須參考語音訊號,且與平均主觀意見分數的評量結果呈現高度相關的語音品質評量技術,乃成為本研究的主題。
另一方面,近年自監督式學習(Self-supervised Learning) 的預訓練(Pre-trained)技術在語音處理領域上已經相當成熟,可以由大規模無標記語料庫中,提取出隱含豐富資訊的特徵向量(Feature Vector)。這些特徵向量被證實能增進多項語音處理任務的表現,如語音辨識、語者辨識、語音翻譯等;只是在無參考客觀語音品質評估上的潛力還未被充分發掘。在本論文中,我們首先分析了自監督式語音表徵用於無參考客觀語音品質評估上的可行性,在實驗中發現,自監督式語音表徵中含有豐富的聲學(Acoustic)資訊及語言 (Linguistic)內容的資訊,且能區隔不同品質的語音訊號,說明其可能相當適合用於無參考語音品質評估。接著,我們基於上述結果,提出了一套全新的、基於 HuBERT 表徵的深層(Deep)無參考語音品質評估技術。實驗結果顯示,這套技術全面超越過去使用傳統語音表徵的所有方法,並在不同語言上有更好的泛化能力。最後,我們也使用探測分析 (Probing Analysis)更深入理解影響模型表現的因素。 Speech quality assessment is to evaluate the quality of audio, and it has been an essential part of speech processing to measure the performance of a system for decades. Conventionally, the mean opinion score (MOS) has been considered the "golden standard" for speech quality assessment, but such measurement involves a large number of human listeners, making it costly and time-consuming. Full-reference objective speech quality assessment approaches have thus been developed to simulate the auditory system of human beings and have been shown to have a high correlation with MOS. However, these approaches require a clean reference signal for comparison with the test signal, limiting their utility when such a signal is unavailable. Accordingly, there is a need to develop a no-reference objective speech quality assessment method that correlates well with human perception and does not require a reference signal, which is the main focus of this thesis. On the other hand, self-supervised pre-trained models that enhance the utility of large-scale unlabeled speech datasets have emerged in the research field of speech processing. The self-supervised models can extract high-level, informative, and compact representation vectors from the raw audio inputs. The extracted representations have been demonstrated beneficial for downstream tasks like speech recognition, speaker verification, speech translation, and spoken language understanding. Nonetheless, the capability of these self-supervised speech representations for speech quality assessment has yet to be well addressed. In this thesis, we first conduct a preliminary analysis to investigate the feasibility of adopting self-supervised speech representations for speech quality assessment. The analysis results demonstrated that these representations contain rich acoustic and linguistic information and can distinguish audio signals of different qualities, suggesting their potential for evaluating speech quality. Accordingly, we proposed a novel, deep no-reference objective speech quality assessment model based on the HuBERT feature. The experiment results showed that our model significantly outperforms the previous state-of-the-art approaches and has better generalization ability across different languages. Moreover, we also conducted several probing analyses to further understand the factors that affect the model performance. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86964 |
DOI: | 10.6342/NTU202300029 |
全文授權: | 同意授權(全球公開) |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-111-1.pdf | 2.45 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。