請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97096| 標題: | 邁向通用語音模型:提示語音語言模型於多樣語音處理任務 Towards a Universal Speech Model: Prompting Speech Language Models for Diverse Speech Processing Tasks |
| 作者: | 張凱爲 Kai-Wei Chang |
| 指導教授: | 李宏毅 Hung-yi Lee |
| 關鍵字: | 語音語言模型,大型語言模型,提示學習,自監督式學習,參數高效率學習, Speech Language Model,Large Language Model,Prompting,Self-supervised Learning,Parameter-efficient Learning, |
| 出版年 : | 2025 |
| 學位: | 博士 |
| 摘要: | 「預訓練-微調」(Pre-train, fine-tune)範式長期以來一直是語音處理領域的主流方法,其透過設計專家模型並微調預訓練語音表徵模型以應對各種下游語音處理任務。然而,隨著需要處理的下游任務數量的增加,此方法在擴展性上面臨重大挑戰,往往需要大量的人力、儲存資源以及計算成本。為了解決這些挑戰,本論文提出了一個高效且通用的語音處理框架,能以統一的方式處理多樣化的任務。
受「提示範式」(Prompting paradigm)在自然語言處理(NLP)領域的成功,以及近期以離散語音單元(Discrete speech unit)為基礎的語音語言模型(Speech LMs)的啟發,本論文率先將提示方法應用於語音處理領域。為此,我們提出了名為 SpeechPrompt 的統一提示框架,專為語音語言模型設計,能夠處理廣泛的任務。 提示方法於系統輸入端插入提示(Prompt)以指引語言模型,並利用語言模型的生成能力來解決多樣化的下游任務,此項設計使提示方法無需進行大量的模型重新設計,進而顯著降低計算需求與人力負擔。透過將語音處理任務重新定義為「語音到單元生成」(Speech-to-unit generation)任務,本研究展示了如何在 SpeechPrompt 框架內整合語音分類(Speech classification)、序列生成(Sequence generation)以及語音生成(Speech generation)任務,為語音處理領域提供了一個可擴展、高效且統一的解決方案。實驗結果顯示,比較提示範式與基於自監督學習(Self-supervised learning)模型的微調方法(Fine-tuning),在具有相似的可訓練參數(Trainable parameters)數量下,能夠達到具有競爭力的表現。此外,在少樣本學習(Few-shot learning)場景中,提示範式展現出相較微調方法更卓越的表現。 本論文亦探討了語言模型的兩大主要架構——編碼器-解碼器(Encoder-decoder)語言模型與解碼器(Decoder-only)語言模型——在提示框架中的應用。我們發現,在提示框架中,編碼器-解碼器語音語言模型的性能優於解碼器架構,這與自然語言處理領域的主流做法有所不同,後者主要聚焦於開發僅解碼器語言模型,此研究可作為後續開發語音語言模型的借鏡。 此外,本論文還研究了另一項重要的提示技術——上下文學習(In-Context Learning,ICL)。此技術可將範例資料作為輸入,讓語音語言模型無需任何額外的訓練即可學習範例資料的模式以執行新任務。實驗結果顯示,語音語言模型的上下文學習能夠展現與簡單監督式學習相當的性能。研究結果驗證了將上下文學習應用於語音處理的可行性,此研究為未來開發高效的語音模型奠定了基礎。 綜上所述,本論文首次系統性地探討了提示框架在語音語言模型中處理多樣化語音任務的應用。此研究深入探索了提示範式於語音處理領域的可行性,為未來語音語言模型的開發提供了重要的參考。所提出的提示框架展現了極大的研究潛力與應用價值,為語音處理技術的進一步發展奠定了基礎。 The "pre-train, fine-tune" paradigm has long been the dominant approach in the field of speech processing. It involves designing expert models and fine-tuning pre-trained speech representation models to serve various downstream speech processing tasks. While effective, this approach faces significant challenges when the number of downstream tasks to be served increases, as it demands considerable human effort, storage capacity, and computational resources. To address these limitations, this thesis proposes the development of an efficient and universal speech processing framework capable of addressing diverse tasks in a unified manner. Inspired by the success of the "prompting paradigm" in Natural Language Processing (NLP) and the recent advancements in Speech Language Models (Speech LMs) trained on quantized discrete speech units, this thesis pioneers the application of prompting to speech processing. To this end, we introduce SpeechPrompt, a unified prompting framework designed for Speech LMs to handle a broad spectrum of tasks. Prompting focuses solely on modifying the input, leveraging the generative capabilities of LMs to tackle diverse downstream tasks without extensive model redesigning. This significantly reduces computational demands and human effort. By reformulating speech processing tasks into speech-to-unit generation, this research demonstrates the seamless integration of speech classification, sequence generation, and speech generation tasks within the SpeechPrompt framework, offering a scalable, efficient, and unified solution to the speech processing field. Experimental results show that, with a similar number of trainable parameters, the prompting method achieves competitive performance compared to fine-tuning approaches that are based on self-supervised learning models. Additionally, prompting demonstrates strong performance in few-shot learning scenarios. This thesis also investigates two major architectures of language models within the prompting paradigm: "Encoder-decoder LMs" and "Decoder-only LMs". Our findings reveal that encoder-decoder speech LMs significantly outperform decoder-only models within the prompting framework, contrasting with the mainstream practice in NLP, which predominantly focuses on developing decoder-only LMs. Furthermore, this thesis explores another key prompting technique: In-Context Learning (ICL). This technique enables speech LMs to perform new tasks without additional training by learning patterns from input examples. Experimental results indicate that in-context learning achieves competitive performance compared to simple supervised baselines. The findings not only validate the feasibility of applying ICL to speech processing but also highlight its potential to reduce computational costs and improve adaptability across diverse tasks. These contributions lay the groundwork for future advancements in scalable and efficient speech model development. In summary, this thesis systematically explores the prompting frameworks in handling diverse speech tasks with speech language models. The research delves into the feasibility of applying the prompting paradigm to speech processing and provides valuable insights for the future development of speech language models. The proposed prompting framework demonstrates significant research potential and practical value, laying a solid foundation for further advancements in speech processing technology. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97096 |
| DOI: | 10.6342/NTU202500479 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2025-02-28 |
| 顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-1.pdf | 8.08 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
