更高效的語音處理：低資源情境下的自監督式學習

劉廷緯; Ting-Wei Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91729

標題:	更高效的語音處理：低資源情境下的自監督式學習 Speech Processing with Higher Efficiency: Self-Supervised Learning for Low-Resource Scenarios
作者:	劉廷緯 Ting-Wei Liu
指導教授:	李宏毅 Hung-Yi Lee
關鍵字:	語音處理,自監督式學習,低資源,預訓練,表徵學習, speech processing,self-supervised learning,low-resource,pre-training,representation learning,
出版年 :	2024
學位:	博士
摘要:	在快速發展的數位語音處理（DSP）領域中，在近年來越發先進的深度神經網絡（DNNs）以及耗費大量資源後得到的人工標註數據的推動下，監督式學習（Supervised Learning）迎來了突破性的發展。然而，由於需要昂貴的人工標註數據以及每個新任務都需要從頭開始訓練深度神經網絡的開銷，使得監督式學習的發展遇到了瓶頸。在資源有限的情況下，傳統的監督式方法容易受限。相較之下，人類從大量未標記數據中自我學習卻自然且高效。舉例來說，一個孩子可以通過僅僅接觸和互動，幾乎不需要明確的指導，就能學習一種語言的基礎知識，掌握語法和詞匯。而機器則完全相反，即使經過數千小時的人工標註數據訓練，也可能無法完全掌握一個人類語言中的語境變化。我們人類展現了將經驗輕鬆編碼成記憶或常識的驚人能力，因此我們能高效地透過回憶，重複使用我們學過的背景知識，以此來面對新的任務。因此，人類可以通過極少的明確指導，識別以前從未見過的對象，掌握不熟悉的技能或學習新語言。這種人類獨有的學習能力使我們能夠非常高效地獲取新知識和技能。為了使機器能夠更貼近並效仿人類的學習方式，我們針對過往語音模型需要用到大量的標註資料這種問題提出方法，希望能有效的利用未標註的語音數據來規避傳統監督式學習方法會遇到的難處。本論文將重心放在自監督式學習（Self-Supervised Learning, SSL）於語音處理上的應用，旨在開發如何高效利用能輕易取得的未標註語音數據來預訓練（Pre-train）模型。我們藉由自監督式學習，開發出高效且可重複使用的自監督式模型。我們所提出的方法使用了比以往更少量的標註資料，且單一模型能一次勝任多種任務，更能增強不同的語音處理任務的效能，特別是在低資源（Low-resource）與零資源（Zero-Resource）的情況下。除此之外，我們也提出新的自監督式學習設計與訓練方針，在有限的電腦運算資源下，能更高效率的預訓練自監督式學習模型。 In the rapidly evolving digital speech processing (DSP) domain, supervised learning, driven by advanced deep neural networks (DNNs) and a wealth of task-specific labeled data, has shown significant progress. However, the need for large labeled datasets and the overhead of training DNNs from scratch for every application render this approach resource-intensive. The conventional supervised approach is particularly limiting in scenarios with limited resources. In contrast, humans are naturally efficient at self-learning from vast amounts of unlabeled data. For instance, a child can learn the basics of a language, grasping syntax and vocabulary, through mere exposure and interaction, with little explicit instruction, while machines, in sharp contrast, require thousands of hours of labeled data to achieve proficiency in speech processing tasks, and even then, they may not fully grasp the nuances and contextual variations inherent in human speech. We exhibit a remarkable ability to effortlessly encode experiences into memories or common sense, efficiently reusing, recalling, and reapplying the learned background knowledge when faced with new tasks. As a result, humans, with minimal explicit instruction, can recognize previously unseen objects, master unfamiliar skills, or learn new languages. This unique learning capability of humans allows us to acquire new knowledge and skills very efficiently. Drawing inspiration from human learning, this thesis pivots from the label-intensive supervised learning paradigm towards self-supervised learning (SSL) algorithms, also referred to as self-supervised representation learning (SSRL) in literature. We aim to leverage unlabeled speech data to circumvent the challenges associated with the conventional supervised learning scheme. In particular, we focus on developing and studying the pre-training of SSRL models from large amounts of unlabeled speech data. The core aim is to develop versatile, reusable, and efficient representations to enhance different digital speech processing (DSP) tasks, especially in low-resource and even zero-resource scenarios. In the first part of the thesis, we demonstrate learning zero-resource tasks with no downstream supervision. We first discover discrete linguistic units from speech without using any labels under an autoencoder reconstruction setting. We found that the proposed representation offers automatic extraction of speech content from speaker style and is sufficient to cover the linguistic contents in a given language. Therefore, we can perform unsupervised voice conversion (VC) for low-resource languages with zero labels. In the ZeroSpeech 2019 Challenge, we achieved outstanding representation performance with a very low bitrate. In the second part of the thesis, we show how an SSRL model can allow us to achieve better speech processing performance with less labeled data. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. In contrast, our method encodes the current frame through joint conditioning on past and future contexts through a single auxiliary task: masked reconstruction on the time axes. Experiment results show that our representation improves performance for phoneme classification while outperforming other approaches. In a low-resource setting, with minimal fine-tuning and a fraction of labeled data (0.1%), we remarkably surpass conventional methods that rely on fully labeled datasets (100%). In the third part of the thesis, we demonstrate learning one single model that can be transferred to a wide range of speech tasks. We present an SSRL model that learns through applying reconstruction loss on three orthogonal axes: time, frequency, and magnitude. This allows the model to capture the rich information in speech, aspiring for a versatile model across varied tasks and domains. In contrast, previous work often learns by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Experimental results show that the proposed representation can benefit several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We achieve robust performance in the comparison by improving upon surface features and outperforming previous models. We show that our speech representations are not only transferable across downstream tasks but also datasets not seen during pre-training, thus increasing reusability and efficiency. In the fourth part, we study the dynamics of the SSRL pre-training process, investigate model design choices, and benchmark our models on the recognized SUPERB challenge. We first investigate the effect of pre-training on different amounts of data and pre-training on various features. We analyze different model sizes and find that smaller models are stronger representation learners than larger models, while larger models are more effective for downstream fine-tuning than smaller models. We evaluate our models with the SUPERB benchmarking protocol and again prove the feasibility of adopting one pre-trained model for many speech tasks. In the last part of the thesis, we investigate underlying factors contributing to the success of SSRL models in speech processing. We first demonstrate that a well-designed SSRL pretext task can benefit downstream performance. Interestingly, slimmer models surpass conventional small models when given a constrained parameter budget. For a limited computational resource, enlarging the model size resulted in better performance gains than increasing the data size. Moreover, given a fixed model size and computing budget, the size of the unlabeled dataset remains vital, as performance suffers when iterating on a small data pool. Additionally, given a fixed training budget, we observed a valley curve in loss and performance as a function of model size, indicating an optimal model size for a specified compute budget. Finally, under a limited computational budget, we pre-train TERA with a new architectural design and optimal model size. This results in TERA achieving superior performance compared to the HuBERT and wav2vec 2.0 models under comparable settings. Our findings shed light on the delicate balance between model and data size, offering insights for effectively training SSRL models within resource-constrained scenarios. Ultimately, our research illuminates the complex dynamics of training speech SSRL models. We offer valuable guidance and insights for future research to develop the next generation of SSRL models in resource-constrained scenarios.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91729
DOI:	10.6342/NTU202400216
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	13.13 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。