Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93286
標題: 高效率語音生成: 運算效率、資料效率及其在語音自監督學習中的應用
Efficient Speech Generation: Computational Efficiency, Data Efficiency, and Its Application in Speech Self-Supervised Learning
作者: 許博竣
Po-chun Hsu
指導教授: 李琳山
Lin-shan Lee
共同指導教授: 李宏毅
Hung-yi Lee
關鍵字: 語音生成,運算效率,資料效率,
Speech Generation,Computational Efficiency,Data Efficiency,
出版年 : 2024
學位: 博士
摘要: 近年來,隨著深度學習的進步,許多語音生成模型展現了出色的表現。儘管取得了亮眼的成果,語音生成技術的發展也伴隨了對運算和資料資源的更大需求,導致其效率受到了限制。本論文旨在從各個方面解決高效語音生成所面臨的挑戰,包括運算效率、資料效率及其在資料高效的語音自監督學習(Self-Supervised Learning, SSL)中的應用。
我們首先關注語音生成的運算效率。我們提出了一種高度壓縮的非自迴歸神經聲碼器(Neural Vocoder),顯著減少了模型大小和訓練所需的運算資源。將改進的架構與額外的後置濾波器相結合,提出的模型無需依賴 GPU 加速,即可實現實時推理和高品質語音輸出。該模型不僅在生成 44 kHz 語音方面展現了卓越的性能,更為高效語音合成樹立了新的基準。
接下來,我們探索自迴歸生成機制,並提高其推理效率。我們引入了創新方法,頻率自迴歸生成(Frequency-wise Autoregressive Generation, FAR)和位自迴歸生成(Bit-wise Autoregressive Generation, BAR),它們分別在不同的域進行自迴歸生成。這些方法大大提高了推理速度,同時保持了良好的語音品質。除了用於神經聲碼器之外,所提出的技術亦可能適用於其他語音生成任務,包括用於自迴歸模型以提高推理效率和用於非自迴歸模型以提高輸出品質,從而擴大其影響。
隨後我們將重點轉向資料效率,解決在收集用於文字引導語音轉換(Text-Guided Voice Conversion)的標記資料時所面臨的高成本問題。我們引入強化學習(Reinforcement Learning, RL)和基於人類回饋的強化學習(Reinforcement Learning from Human Feedback, RLHF)來增強生成語音的表現力。我們的方法減少了對大型標記資料集的依賴,並提高了模型處理複雜風格的文字描述和產生富有表現力的語音的能力,從而在客觀和主觀評估方面展現了顯著改進。
最後,我們擴展了研究範圍,透過語音生成技術提高語音SSL中的資料效率。我們利用高品質文字轉語音系統產生的合成語音資料來增強低資源的(Low-Resource)預訓練(Pre-training)語料庫,減少對大量現實世界語音資料的需求。提出的方法表明,合成資料可以有效地補充真實資料,從而以更少的資源達到具有競爭力的性能。
總體而言,本論文對提高語音生成效率及其在語音處理中的應用做出了重大貢獻。我們引入新穎的架構、生成方法及學習範式來解決運算和資料效率的挑戰,為該領域的未來進步奠定基礎。
Speech generation models have achieved outstanding performance with the advancement of deep learning in recent years. Despite the remarkable achievements, the development of speech generation technology has also been accompanied by greater demands on computational and data resources, resulting in limitations in its efficiency. This thesis aims to address the challenge of achieving efficient speech generation from various aspects, including computational efficiency, data efficiency, and its application for data-efficient speech self-supervised learning (SSL).
We first focus on the computational efficiency of speech generation. We propose a highly compressed non-autoregressive neural vocoder, significantly reducing model size and computational resources for training. By integrating the improved architecture with an additional post-filter, the proposed model achieves high-quality speech output with real-time inference capabilities without relying on GPU acceleration. This model not only demonstrates superior performance in generating 44 kHz speech but also sets a new benchmark for efficient speech synthesis.
Next, we explore autoregressive generation mechanisms and enhance inference efficiency. We introduce innovative methods, Frequency-wise Autoregressive Generation (FAR) and Bit-wise Autoregressive Generation (BAR), which perform the autoregressive processes in different domains. These methods drastically improve inference speed while maintaining high speech quality. Besides neural vocoders, the proposed techniques are versatile and have the potential to be applied to other speech generation tasks, including autoregressive models for efficient inference and non-autoregressive models for better quality, thereby broadening their impact.
Then, we shift the focus to data efficiency, addressing the high costs associated with collecting labeled data for text-guided voice conversion. We introduce reinforcement learning (RL) and reinforcement learning from human feedback (RLHF) to enhance the expressiveness of generated speech. Our approach reduces the dependency on large, labeled datasets and improves the model's ability to handle text descriptions of complex speech styles and generate expressive speech, achieving significant improvements in both objective and subjective evaluations.
Lastly, we extend the scope of our research to improve data efficiency in speech SSL with speech generation techniques. By leveraging synthetic speech data generated from a high-quality text-to-speech system, we augment the low-resource pre-training corpus, reducing the need for extensive real-world speech data. The proposed approach demonstrates that synthetic data can effectively supplement real data, enabling competitive performance with significantly fewer resources.
Overall, this thesis makes substantial contributions to enhancing the efficiency of speech generation and its applications in speech processing. We introduce novel architectures, generation methods, and learning paradigms that address computational and data efficiency challenges, setting the stage for future advancements in the field.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93286
DOI: 10.6342/NTU202401934
全文授權: 同意授權(全球公開)
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf2.6 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved